Package org.apache.fop.util
Class CharUtilities
java.lang.Object
org.apache.fop.util.CharUtilities
This class provides utilities to distinguish various kinds of Unicode
whitespace and to get character widths in a given FontState.
-
Field Summary
Modifier and TypeFieldDescriptionstatic final char
carriage returnstatic final char
Character code used to signal a character boundary in inline content, such as an inline with borders and padding or a nested block object.static final int
Character class: Boundary between text runsstatic final char
Ideogreaphic spacestatic final char
line-separatorstatic final int
Character class: Line feedstatic final char
linefeed characterstatic final char
left-to-right embeddingstatic final char
left-to-right markstatic final char
left-to-right overridestatic final char
missing ideographstatic final char
non-breaking spacestatic final char
next line control characterstatic final int
Character class: non-whitespacestatic final char
Unicode value indicating the the character is "not a character".static final char
null charstatic final char
Object replacement characterstatic final char
paragraph-separatorstatic final char
pop directional formattingstatic final char
right-to-left embeddingstatic final char
right-to-left markstatic final char
right-to-left overridestatic final char
soft hyphenstatic final char
normal spacestatic final char
normal tabstatic final int
Character class: Unicode white spacestatic final char
word joinerstatic final int
Character class: XML whitespacestatic final char
zero-width joinerstatic final char
zero-width no-break space (= byte order mark)static final char
zero-width space -
Constructor Summary
ModifierConstructorDescriptionprotected
Utility class: Constructor prevents instantiating when subclassed. -
Method Summary
Modifier and TypeMethodDescriptionstatic String
charToNCRef
(int c) Convert a single unicode scalar value to an XML numeric character reference.static int
classOf
(int c) Return the appropriate CharClass constant for the type of the passed character.Creates an iterator to iter aCharSequence
codepoints.codepointsIter
(CharSequence s, int beginIndex, int endIndex) Creates an iterator to iter a sub-CharSequence codepoints.static boolean
containsSurrogatePairAt
(CharSequence chars, int index) Tells whether there is a surrogate pair starting from the given index in theCharSequence
.static String
format
(int c) Format character for debugging output, which it is prefixed with "0x", padded left with '0' and either 4 or 6 hex characters in width according to whether it is in the BMP or not.static int
incrementIfNonBMP
(int codePoint) Returns 1 if codePoint not in the BMP.static boolean
isAdjustableSpace
(int c) Method to determine if the character is an adjustable space.static boolean
isAlphabetic
(int c) Indicates whether a character is classified as "Alphabetic" by the Unicode standard.static boolean
isAnySpace
(int c) Determines if the character represents any kind of space.static boolean
isBmpCodePoint
(int codePoint) Determine whether the specified character (Unicode code point) is in then Basic Multilingual Plane (BMP).static boolean
isBreakableSpace
(int c) Helper method to determine if the character is a space with normal behavior.static boolean
isExplicitBreak
(int c) Indicates whether the given character is an explicit break-characterstatic boolean
isFixedWidthSpace
(int c) Method to determine if the character is a (breakable) fixed-width space.static boolean
isNonBreakableSpace
(int c) Method to determine if the character is a nonbreaking space.static boolean
isSameSequence
(CharSequence cs1, CharSequence cs2) Determine if two character sequences contain the same characters.static boolean
isSurrogatePair
(char ch) Determine if the given characters is part of a surrogate pair.static boolean
isZeroWidthSpace
(int c) Method to determine if the character is a zero-width space.static String
Pad a string S on left out to width W using padding character PAD.static String
Convert a string to a sequence of ASCII or XML numeric character references.
-
Field Details
-
CODE_EOT
public static final char CODE_EOTCharacter code used to signal a character boundary in inline content, such as an inline with borders and padding or a nested block object.- See Also:
-
UCWHITESPACE
public static final int UCWHITESPACECharacter class: Unicode white space- See Also:
-
LINEFEED
public static final int LINEFEEDCharacter class: Line feed- See Also:
-
EOT
public static final int EOTCharacter class: Boundary between text runs- See Also:
-
NONWHITESPACE
public static final int NONWHITESPACECharacter class: non-whitespace- See Also:
-
XMLWHITESPACE
public static final int XMLWHITESPACECharacter class: XML whitespace- See Also:
-
NULL_CHAR
public static final char NULL_CHARnull char- See Also:
-
LINEFEED_CHAR
public static final char LINEFEED_CHARlinefeed character- See Also:
-
CARRIAGE_RETURN
public static final char CARRIAGE_RETURNcarriage return- See Also:
-
TAB
public static final char TABnormal tab- See Also:
-
SPACE
public static final char SPACEnormal space- See Also:
-
NBSPACE
public static final char NBSPACEnon-breaking space- See Also:
-
NEXT_LINE
public static final char NEXT_LINEnext line control character- See Also:
-
ZERO_WIDTH_SPACE
public static final char ZERO_WIDTH_SPACEzero-width space- See Also:
-
WORD_JOINER
public static final char WORD_JOINERword joiner- See Also:
-
ZERO_WIDTH_JOINER
public static final char ZERO_WIDTH_JOINERzero-width joiner- See Also:
-
LRM
public static final char LRMleft-to-right mark- See Also:
-
RLM
public static final char RLMright-to-left mark- See Also:
-
LRE
public static final char LREleft-to-right embedding- See Also:
-
RLE
public static final char RLEright-to-left embedding- See Also:
-
PDF
public static final char PDFpop directional formatting- See Also:
-
LRO
public static final char LROleft-to-right override- See Also:
-
RLO
public static final char RLOright-to-left override- See Also:
-
ZERO_WIDTH_NOBREAK_SPACE
public static final char ZERO_WIDTH_NOBREAK_SPACEzero-width no-break space (= byte order mark)- See Also:
-
SOFT_HYPHEN
public static final char SOFT_HYPHENsoft hyphen- See Also:
-
LINE_SEPARATOR
public static final char LINE_SEPARATORline-separator- See Also:
-
PARAGRAPH_SEPARATOR
public static final char PARAGRAPH_SEPARATORparagraph-separator- See Also:
-
MISSING_IDEOGRAPH
public static final char MISSING_IDEOGRAPHmissing ideograph- See Also:
-
IDEOGRAPHIC_SPACE
public static final char IDEOGRAPHIC_SPACEIdeogreaphic space- See Also:
-
OBJECT_REPLACEMENT_CHARACTER
public static final char OBJECT_REPLACEMENT_CHARACTERObject replacement character- See Also:
-
NOT_A_CHARACTER
public static final char NOT_A_CHARACTERUnicode value indicating the the character is "not a character".- See Also:
-
-
Constructor Details
-
CharUtilities
protected CharUtilities()Utility class: Constructor prevents instantiating when subclassed.
-
-
Method Details
-
classOf
public static int classOf(int c) Return the appropriate CharClass constant for the type of the passed character.- Parameters:
c
- character to inspect- Returns:
- the determined character class
-
isBreakableSpace
public static boolean isBreakableSpace(int c) Helper method to determine if the character is a space with normal behavior. Normal behavior means that it's not non-breaking.- Parameters:
c
- character to inspect- Returns:
- True if the character is a normal space
-
isZeroWidthSpace
public static boolean isZeroWidthSpace(int c) Method to determine if the character is a zero-width space.- Parameters:
c
- the character to check- Returns:
- true if the character is a zero-width space
-
isFixedWidthSpace
public static boolean isFixedWidthSpace(int c) Method to determine if the character is a (breakable) fixed-width space.- Parameters:
c
- the character to check- Returns:
- true if the character has a fixed-width
-
isNonBreakableSpace
public static boolean isNonBreakableSpace(int c) Method to determine if the character is a nonbreaking space.- Parameters:
c
- character to check- Returns:
- True if the character is a nbsp
-
isAdjustableSpace
public static boolean isAdjustableSpace(int c) Method to determine if the character is an adjustable space.- Parameters:
c
- character to check- Returns:
- True if the character is adjustable
-
isAnySpace
public static boolean isAnySpace(int c) Determines if the character represents any kind of space.- Parameters:
c
- character to check- Returns:
- True if the character represents any kind of space
-
isAlphabetic
public static boolean isAlphabetic(int c) Indicates whether a character is classified as "Alphabetic" by the Unicode standard.- Parameters:
c
- the character- Returns:
- true if the character is "Alphabetic"
-
isExplicitBreak
public static boolean isExplicitBreak(int c) Indicates whether the given character is an explicit break-character- Parameters:
c
- the character to check- Returns:
- true if the character represents an explicit break
-
charToNCRef
Convert a single unicode scalar value to an XML numeric character reference. If in the BMP, four digits are used, otherwise 6 digits are used.- Parameters:
c
- a unicode scalar value- Returns:
- a string representing a numeric character reference
-
toNCRefs
Convert a string to a sequence of ASCII or XML numeric character references.- Parameters:
s
- a java string (encoded in UTF-16)- Returns:
- a string representing a sequence of numeric character reference or ASCII characters
-
padLeft
Pad a string S on left out to width W using padding character PAD.- Parameters:
s
- string to padwidth
- width of field to add paddingpad
- character to use for padding- Returns:
- padded string
-
format
Format character for debugging output, which it is prefixed with "0x", padded left with '0' and either 4 or 6 hex characters in width according to whether it is in the BMP or not.- Parameters:
c
- character code- Returns:
- formatted character string
-
isSameSequence
Determine if two character sequences contain the same characters.- Parameters:
cs1
- first character sequencecs2
- second character sequence- Returns:
- true if both sequences have same length and same character sequence
-
isBmpCodePoint
public static boolean isBmpCodePoint(int codePoint) Determine whether the specified character (Unicode code point) is in then Basic Multilingual Plane (BMP). Such code points can be represented using a singlechar
.- Parameters:
codePoint
- the character (Unicode code point) to be tested- Returns:
true
if the specified code point is between Character#MIN_VALUE and Character#MAX_VALUE} inclusive;false
otherwise- See Also:
-
incrementIfNonBMP
public static int incrementIfNonBMP(int codePoint) Returns 1 if codePoint not in the BMP. This function is particularly useful in for loops over strings where, in presence of surrogate pairs, you need to skip one loop.- Parameters:
codePoint
- 1 if codePoint > 0xFFFF, 0 otherwise- Returns:
- 1 if codePoint > 0xFFFF, 0 otherwise
-
isSurrogatePair
public static boolean isSurrogatePair(char ch) Determine if the given characters is part of a surrogate pair.- Parameters:
ch
- character to be checked- Returns:
- true if ch is an high surrogate or a low surrogate
-
containsSurrogatePairAt
Tells whether there is a surrogate pair starting from the given index in theCharSequence
. If the character at index is an high surrogate then the character at index+1 is checked to be a low surrogate. If a malformed surrogate pair is encountered then anIllegalArgumentException
is thrown.high surrogate [0xD800 - 0xDC00] low surrogate [0xDC00 - 0xE000]
- Parameters:
chars
- CharSequence to checkindex
- index in the CharSequqnce where to start the check- Returns:
- true if there is a well-formed surrogate pair at index
- Throws:
IllegalArgumentException
- if there wrong usage of surrogate pairs
-
codepointsIter
Creates an iterator to iter aCharSequence
codepoints.- Parameters:
s
-CharSequence
to iter- Returns:
- codepoint iterator for the given
CharSequence
. - See Also:
-
codepointsIter
Creates an iterator to iter a sub-CharSequence codepoints.- Parameters:
s
-CharSequence
to iterbeginIndex
- lower rangeendIndex
- upper range- Returns:
- codepoint iterator for the given sub-CharSequence.
- See Also:
-