Main Page   Class Hierarchy   Alphabetical List   Data Structures   File List   Data Fields   Globals  

UnicodeSet Class Reference

A mutable set of Unicode characters. More...

#include <uniset.h>

Inheritance diagram for UnicodeSet:

UnicodeFilter UnicodeFunctor UnicodeMatcher

Public Methods

 UnicodeSet ()
 Constructs an empty set. More...

 UnicodeSet (UChar32 start, UChar32 end)
 Constructs a set containing the given range. More...

 UnicodeSet (const UnicodeString &pattern, UErrorCode &status)
 Constructs a set from the given pattern. More...

 UnicodeSet (int8_t category, UErrorCode &status)
 DEPRECATED Constructs a set from the given Unicode character category. More...

 UnicodeSet (const UnicodeSet &o)
 Constructs a set that is identical to the given UnicodeSet. More...

virtual ~UnicodeSet ()
 Destructs the set. More...

UnicodeSet & operator= (const UnicodeSet &o)
 Assigns this object to be a copy of another. More...

virtual UBool operator== (const UnicodeSet &o) const
 Compares the specified object with this set for equality. More...

UBool operator!= (const UnicodeSet &o) const
 Compares the specified object with this set for equality. More...

virtual UnicodeFunctorclone () const
 Returns a copy of this object. More...

virtual int32_t hashCode (void) const
 Returns the hash code value for this set. More...

void set (UChar32 start, UChar32 end)
 Make this object represent the range start - end. More...

virtual void applyPattern (const UnicodeString &pattern, UErrorCode &status)
 Modifies this set to represent the set specified by the given pattern, optionally ignoring white space. More...

virtual UnicodeStringtoPattern (UnicodeString &result, UBool escapeUnprintable=FALSE) const
 Returns a string representation of this set. More...

virtual int32_t size (void) const
 Returns the number of elements in this set (its cardinality), n, where 0 <= n <= 65536. More...

virtual UBool isEmpty (void) const
 Returns true if this set contains no elements. More...

virtual UBool contains (UChar32 start, UChar32 end) const
 Returns true if this set contains the specified range of chars. More...

virtual UBool contains (UChar32 c) const
 Returns true if this set contains the specified char. More...

UMatchDegree matches (const Replaceable &text, int32_t &offset, int32_t limit, UBool incremental)
 Implement UnicodeMatcher::matches().

int32_t indexOf (UChar32 c) const
 Returns the index of the given character within this set, where the set is ordered by ascending code point. More...

UChar32 charAt (int32_t index) const
 Returns the character at the given index within this set, where the set is ordered by ascending code point. More...

virtual void add (UChar32 start, UChar32 end)
 Adds the specified range to this set if it is not already present. More...

void add (UChar32 c)
 Adds the specified character to this set if it is not already present. More...

virtual void retain (UChar32 start, UChar32 end)
 Retain only the elements in this set that are contained in the specified range. More...

void retain (UChar32 c)
 Retain the specified character from this set if it is present. More...

virtual void remove (UChar32 start, UChar32 end)
 Removes the specified range from this set if it is present. More...

void remove (UChar32 c)
 Removes the specified character from this set if it is present. More...

virtual void complement (void)
 Inverts this set. More...

virtual void complement (UChar32 start, UChar32 end)
 Complements the specified range in this set. More...

void complement (UChar32 c)
 Complements the specified character in this set. More...

virtual UBool containsAll (const UnicodeSet &c) const
 Returns true if the specified set is a subset of this set. More...

virtual void addAll (const UnicodeSet &c)
 Adds all of the elements in the specified set to this set if they're not already present. More...

virtual void retainAll (const UnicodeSet &c)
 Retains only the elements in this set that are contained in the specified set. More...

virtual void removeAll (const UnicodeSet &c)
 Removes from this set all of its elements that are contained in the specified set. More...

virtual void complementAll (const UnicodeSet &c)
 Complements in this set all elements contained in the specified set. More...

virtual void clear (void)
 Removes all of the elements from this set. More...

virtual int32_t getRangeCount (void) const
 Iteration method that returns the number of ranges contained in this set. More...

virtual UChar32 getRangeStart (int32_t index) const
 Iteration method that returns the first character in the specified range of this set. More...

virtual UChar32 getRangeEnd (int32_t index) const
 Iteration method that returns the last character in the specified range of this set. More...

virtual void compact ()
 Reallocate this objects internal structures to take up the least possible space, without changing this object's value.

virtual UClassID getDynamicClassID (void) const
 Implement UnicodeFunctor API. More...


Static Public Methods

UBool resemblesPattern (const UnicodeString &pattern, int32_t pos)
 Return true if the given position, in the given pattern, appears to be the start of a UnicodeSet pattern.

UClassID getStaticClassID (void)
 Return the class ID for this class. More...


Static Public Attributes

const UChar32 MIN_VALUE
 Minimum value that can be stored in a UnicodeSet.

const UChar32 MAX_VALUE
 Maximum value that can be stored in a UnicodeSet.


Friends

class NormalizationTransliterator
class Transliterator
class TransliteratorParser
class TransliteratorIDParser
class TransliterationRule

Detailed Description

A mutable set of Unicode characters.

Objects of this class represent character classes used in regular expressions. A character specifies a subset of Unicode code points. Legal code points are U+0000 to U+10FFFF, inclusive.

UnicodeSet supports two APIs. The first is the operand API that allows the caller to modify the value of a UnicodeSet object. It conforms to Java 2's java.util.Set interface, although UnicodeSet does not actually implement that interface. All methods of Set are supported, with the modification that they take a character range or single character instead of an Object, and they take a UnicodeSet instead of a Collection. The operand API may be thought of in terms of boolean logic: a boolean OR is implemented by add, a boolean AND is implemented by retain, a boolean XOR is implemented by complement taking an argument, and a boolean NOT is implemented by complement with no argument. In terms of traditional set theory function names, add is a union, retain is an intersection, remove is an asymmetric difference, and complement with no argument is a set complement with respect to the superset range MIN_VALUE-MAX_VALUE

The second API is the applyPattern()/toPattern() API from the java.text.Format-derived classes. Unlike the methods that add characters, add categories, and control the logic of the set, the method applyPattern() sets all attributes of a UnicodeSet at once, based on a string pattern.

Pattern syntax

Patterns are accepted by the constructors and the applyPattern() methods and returned by the toPattern() method. These patterns follow a syntax similar to that employed by version 8 regular expression character classes:

pattern :=  ('[' '^'? item* ']') | property
item :=  char | (char '-' char) | pattern-expr
pattern-expr :=  pattern | pattern-expr pattern | pattern-expr op pattern
op :=  '&' | '-'
special :=  '[' | ']' | '-'
char :=  any character that is not special
| ('\u005C'
any character)
| ('\u005Cu' hex hex hex hex)
hex :=  any character for which Character.digit(c, 16) returns a non-negative result
property :=  a Unicode property set pattern

Legend:
a := b   a may be replaced by b
a? zero or one instance of a
a* one or more instances of a
a | b either a or b
'a' the literal string between the quotes

Any character may be preceded by a backslash in order to remove any special meaning. White space characters, as defined by UCharacter.isWhitespace(), are ignored, unless they are escaped.

Property patterns specify a set of characters having a certain property as defined by the Unicode standard. Both the POSIX-like "[:Lu:]" and the Perl-like syntax "\p{Lu}" are recognized. For a complete list of supported property patterns, see the User's Guide for UnicodeSet at http://oss.software.ibm.com/icu/userguide/unicodeset.html. Actual determination of property data is defined by the underlying Unicode database as implemented by UCharacter.

Patterns specify individual characters, ranges of characters, and Unicode property sets. When elements are concatenated, they specify their union. To complement a set, place a '^' immediately after the opening '['. Property patterns are inverted by modifying their delimiters; "[:^foo]" and "\P{foo}". In any other location, '^' has no special meaning.

Ranges are indicated by placing two a '-' between two characters, as in "a-z". This specifies the range of all characters from the left to the right, in Unicode order. If the left character is greater than or equal to the right character it is a syntax error. If a '-' occurs as the first character after the opening '[' or '[^', or if it occurs as the last character before the closing ']', then it is taken as a literal. Thus "[a\u005C-b]", "[-ab]", and "[ab-]" all indicate the same set of three characters, 'a', 'b', and '-'.

Sets may be intersected using the '&' operator or the asymmetric set difference may be taken using the '-' operator, for example, "[[:L:]&[\u005Cu0000-\u005Cu0FFF]]" indicates the set of all Unicode letters with values less than 4096. Operators ('&' and '|') have equal precedence and bind left-to-right. Thus "[[:L:]-[a-z]-[\u005Cu0100-\u005Cu01FF]]" is equivalent to "[[[:L:]-[a-z]]-[\u005Cu0100-\u005Cu01FF]]". This only really matters for difference; intersection is commutative.

[a]The set containing 'a'
[a-z]The set containing 'a' through 'z' and all letters in between, in Unicode order
[^a-z]The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF
[[pat1][pat2]] The union of sets specified by pat1 and pat2
[[pat1]&[pat2]] The intersection of sets specified by pat1 and pat2
[[pat1]-[pat2]] The asymmetric difference of sets specified by pat1 and pat2
[:Lu:] or \p{Lu} The set of characters having the specified Unicode property; in this case, Unicode uppercase letters
[:^Lu:] or \P{Lu} The set of characters not having the given Unicode property

Author:
Alan Liu @stable


Constructor & Destructor Documentation

UnicodeSet::UnicodeSet  
 

Constructs an empty set.

@stable

UnicodeSet::UnicodeSet UChar32    start,
UChar32    end
 

Constructs a set containing the given range.

If end > start then an empty set is created.

Parameters:
start  first character, inclusive, of range
end  last character, inclusive, of range

UnicodeSet::UnicodeSet const UnicodeString   pattern,
UErrorCode   status
 

Constructs a set from the given pattern.

See the class description for the syntax of the pattern language.

Parameters:
pattern  a string specifying what characters are in the set
Exceptions:
code  >IllegalArgumentException if the pattern contains a syntax error. @stable

UnicodeSet::UnicodeSet int8_t    category,
UErrorCode   status
 

DEPRECATED Constructs a set from the given Unicode character category.

Parameters:
category  an integer indicating the character category as defined in uchar.h.
Deprecated:
To be removed after 2002-DEC-31

UnicodeSet::UnicodeSet const UnicodeSet &    o
 

Constructs a set that is identical to the given UnicodeSet.

@stable

virtual UnicodeSet::~UnicodeSet   [virtual]
 

Destructs the set.

@stable


Member Function Documentation

void UnicodeSet::add UChar32    c
 

Adds the specified character to this set if it is not already present.

If this set already contains the specified character, the call leaves this set unchanged. @draft ICU 2.0

virtual void UnicodeSet::add UChar32    start,
UChar32    end
[virtual]
 

Adds the specified range to this set if it is not already present.

If this set already contains the specified range, the call leaves this set unchanged. If end > start then an empty range is added, leaving the set unchanged. This is equivalent to a boolean logic OR, or a set UNION.

Parameters:
start  first character, inclusive, of range to be added to this set.
end  last character, inclusive, of range to be added to this set. @draft ICU 2.0

virtual void UnicodeSet::addAll const UnicodeSet &    c [virtual]
 

Adds all of the elements in the specified set to this set if they're not already present.

This operation effectively modifies this set so that its value is the union of the two sets. The behavior of this operation is unspecified if the specified collection is modified while the operation is in progress.

Parameters:
c  set whose elements are to be added to this set.
See also:
add(char, char) @stable

virtual void UnicodeSet::applyPattern const UnicodeString   pattern,
UErrorCode   status
[virtual]
 

Modifies this set to represent the set specified by the given pattern, optionally ignoring white space.

See the class description for the syntax of the pattern language.

Parameters:
pattern  a string specifying what characters are in the set
Exceptions:
code  >IllegalArgumentException if the pattern contains a syntax error. @stable

UChar32 UnicodeSet::charAt int32_t    index const
 

Returns the character at the given index within this set, where the set is ordered by ascending code point.

If the index is out of range, return (UChar32)-1. The inverse of this method is indexOf().

Parameters:
index  an index from 0..size()-1
Returns:
the character at the given index, or (UChar32)-1.

virtual void UnicodeSet::clear void    [virtual]
 

Removes all of the elements from this set.

This set will be empty after this call returns. @stable

virtual UnicodeFunctor* UnicodeSet::clone   const [virtual]
 

Returns a copy of this object.

All UnicodeFunctor objects have to support cloning in order to allow classes using UnicodeFunctors, such as Transliterator, to implement cloning. @draft ICU 2.0

Implements UnicodeFunctor.

void UnicodeSet::complement UChar32    c
 

Complements the specified character in this set.

The character will be removed if it is in this set, or will be added if it is not in this set. @draft ICU 2.0

virtual void UnicodeSet::complement UChar32    start,
UChar32    end
[virtual]
 

Complements the specified range in this set.

Any character in the range will be removed if it is in this set, or will be added if it is not in this set. If end > start then an empty range is complemented, leaving the set unchanged. This is equivalent to a boolean logic XOR.

Parameters:
start  first character, inclusive, of range to be removed from this set.
end  last character, inclusive, of range to be removed from this set. @draft ICU 2.0

virtual void UnicodeSet::complement void    [virtual]
 

Inverts this set.

This operation modifies this set so that its value is its complement. This is equivalent to complement(MIN_VALUE, MAX_VALUE). @stable

virtual void UnicodeSet::complementAll const UnicodeSet &    c [virtual]
 

Complements in this set all elements contained in the specified set.

Any character in the other set will be removed if it is in this set, or will be added if it is not in this set.

Parameters:
c  set that defines which elements will be xor'ed from this set.

virtual UBool UnicodeSet::contains UChar32    c const [virtual]
 

Returns true if this set contains the specified char.

Returns:
true if this set contains the specified char. @draft ICU 2.0

Implements UnicodeFilter.

virtual UBool UnicodeSet::contains UChar32    start,
UChar32    end
const [virtual]
 

Returns true if this set contains the specified range of chars.

Returns:
true if this set contains the specified range of chars. @draft ICU 2.0

virtual UBool UnicodeSet::containsAll const UnicodeSet &    c const [virtual]
 

Returns true if the specified set is a subset of this set.

Parameters:
c  set to be checked for containment in this set.
Returns:
true if this set contains all of the elements of the specified set. @stable

virtual UClassID UnicodeSet::getDynamicClassID void    const [inline, virtual]
 

Implement UnicodeFunctor API.

Returns:
The class ID for this object. All objects of a given class have the same class ID. Objects of other classes have different class IDs.

Reimplemented from UnicodeFunctor.

virtual int32_t UnicodeSet::getRangeCount void    const [virtual]
 

Iteration method that returns the number of ranges contained in this set.

See also:
getRangeStart , getRangeEnd

virtual UChar32 UnicodeSet::getRangeEnd int32_t    index const [virtual]
 

Iteration method that returns the last character in the specified range of this set.

See also:
getRangeStart , getRangeEnd

virtual UChar32 UnicodeSet::getRangeStart int32_t    index const [virtual]
 

Iteration method that returns the first character in the specified range of this set.

See also:
getRangeCount , getRangeEnd

UClassID UnicodeSet::getStaticClassID void    [inline, static]
 

Return the class ID for this class.

This is useful only for comparing to a return value from getDynamicClassID(). For example:

 .      Base* polymorphic_pointer = createPolymorphicObject();
 .      if (polymorphic_pointer->getDynamicClassID() ==
 .          Derived::getStaticClassID()) ...
 
Returns:
The class ID for all objects of this class. @stable

Reimplemented from UnicodeFunctor.

virtual int32_t UnicodeSet::hashCode void    const [virtual]
 

Returns the hash code value for this set.

Returns:
the hash code value for this set.
See also:
Object::hashCode() @stable

int32_t UnicodeSet::indexOf UChar32    c const
 

Returns the index of the given character within this set, where the set is ordered by ascending code point.

If the character is not in this set, return -1. The inverse of this method is charAt().

Returns:
an index from 0..size()-1, or -1

virtual UBool UnicodeSet::isEmpty void    const [virtual]
 

Returns true if this set contains no elements.

Returns:
true if this set contains no elements. @stable

UBool UnicodeSet::operator!= const UnicodeSet &    o const [inline]
 

Compares the specified object with this set for equality.

Returns true if the specified set is not equal to this set. @stable

UnicodeSet& UnicodeSet::operator= const UnicodeSet &    o
 

Assigns this object to be a copy of another.

@stable

virtual UBool UnicodeSet::operator== const UnicodeSet &    o const [virtual]
 

Compares the specified object with this set for equality.

Returns true if the two sets have the same size, and every member of the specified set is contained in this set (or equivalently, every member of this set is contained in the specified set).

Parameters:
o  set to be compared for equality with this set.
Returns:
true if the specified set is equal to this set. @stable

void UnicodeSet::remove UChar32    c
 

Removes the specified character from this set if it is present.

The set will not contain the specified range once the call returns. @draft ICU 2.0

virtual void UnicodeSet::remove UChar32    start,
UChar32    end
[virtual]
 

Removes the specified range from this set if it is present.

The set will not contain the specified range once the call returns. If end > start then an empty range is removed, leaving the set unchanged.

Parameters:
start  first character, inclusive, of range to be removed from this set.
end  last character, inclusive, of range to be removed from this set. @draft ICU 2.0

virtual void UnicodeSet::removeAll const UnicodeSet &    c [virtual]
 

Removes from this set all of its elements that are contained in the specified set.

This operation effectively modifies this set so that its value is the asymmetric set difference of the two sets.

Parameters:
c  set that defines which elements will be removed from this set. @stable

void UnicodeSet::retain UChar32    c
 

Retain the specified character from this set if it is present.

@draft ICU 2.0

virtual void UnicodeSet::retain UChar32    start,
UChar32    end
[virtual]
 

Retain only the elements in this set that are contained in the specified range.

If end > start then an empty range is retained, leaving the set empty. This is equivalent to a boolean logic AND, or a set INTERSECTION.

Parameters:
start  first character, inclusive, of range to be retained to this set.
end  last character, inclusive, of range to be retained to this set. @draft ICU 2.0

virtual void UnicodeSet::retainAll const UnicodeSet &    c [virtual]
 

Retains only the elements in this set that are contained in the specified set.

In other words, removes from this set all of its elements that are not contained in the specified set. This operation effectively modifies this set so that its value is the intersection of the two sets.

Parameters:
c  set that defines which elements this set will retain. @stable

void UnicodeSet::set UChar32    start,
UChar32    end
 

Make this object represent the range start - end.

If end > start then this object is set to an an empty range.

Parameters:
start  first character in the set, inclusive @rparam end last character in the set, inclusive

virtual int32_t UnicodeSet::size void    const [virtual]
 

Returns the number of elements in this set (its cardinality), n, where 0 <= n <= 65536.

Returns:
the number of elements in this set (its cardinality). @stable

virtual UnicodeString& UnicodeSet::toPattern UnicodeString   result,
UBool    escapeUnprintable = FALSE
const [virtual]
 

Returns a string representation of this set.

If the result of calling this function is passed to a UnicodeSet constructor, it will produce another set that is equal to this one.

Parameters:
result  the string to receive the rules. Previous contents will be deleted.
escapeUnprintable  if TRUE then convert unprintable character to their hex escape representations, \uxxxx or \Uxxxxxxxx. Unprintable characters are those other than U+000A, U+0020..U+007E. @draft ICU 2.0

Reimplemented from UnicodeFilter.


The documentation for this class was generated from the following file:
Generated on Mon Mar 4 21:44:03 2002 for ICU 2.0 by doxygen1.2.14 written by Dimitri van Heesch, © 1997-2002