Code Examples
package main
import (
"fmt"
"golang.org/x/text/cases"
"golang.org/x/text/language"
)
func main() {
src := []string{
"hello world!",
"i with dot",
"'n ijsberg",
"here comes O'Brian",
}
for _, c := range []cases.Caser{
cases.Lower(language.Und),
cases.Upper(language.Turkish),
cases.Title(language.Dutch),
cases.Title(language.Und, cases.NoLower),
} {
fmt.Println()
for _, s := range src {
fmt.Println(c.String(s))
}
}
}
Package-Level Type Names (total 17, in which 2 are exported)
/* sort exporteds by: | */
A Caser transforms given input to a certain case. It implements
transform.Transformer.
A Caser may be stateful and should therefore not be shared between
goroutines.
ttransform.SpanningTransformer
Bytes returns a new byte slice with the result of converting b to the case
form implemented by c.
Reset resets the Caser to be reused for new input after a previous call to
Transform.
Span implements the transform.SpanningTransformer interface.
String returns a string with the result of transforming s to the case form
implemented by c.
Transform implements the transform.Transformer interface and transforms the
given input to the case form implemented by c.
T : golang.org/x/text/transform.SpanningTransformer
T : golang.org/x/text/transform.Transformer
T : vendor/golang.org/x/text/transform.SpanningTransformer
T : vendor/golang.org/x/text/transform.Transformer
func Fold(opts ...Option) Caser
func Lower(t language.Tag, opts ...Option) Caser
func Title(t language.Tag, opts ...Option) Caser
func Upper(t language.Tag, opts ...Option) Caser
caseTrie. Total size: 12538 bytes (12.24 KiB). Checksum: af4dfa7d60c71d4c.
lookup returns the trie value for the first UTF-8 encoding in s and
the width in bytes of this encoding. The size will be 0 if s does not
hold enough bytes to complete the encoding. len(s) must be greater than 0.
lookupString returns the trie value for the first UTF-8 encoding in s and
the width in bytes of this encoding. The size will be 0 if s does not
hold enough bytes to complete the encoding. len(s) must be greater than 0.
lookupStringUnsafe returns the trie value for the first UTF-8 encoding in s.
s must start with a full and valid UTF-8 encoded rune.
lookupUnsafe returns the trie value for the first UTF-8 encoding in s.
s must start with a full and valid UTF-8 encoded rune.
lookupValue determines the type of block n and looks up the value for b.
func newCaseTrie(i int) *caseTrie
var trie *caseTrie
A context is used for iterating over source bytes, fetching case info and
writing to a destination buffer.
Casing operations may need more than one rune of context to decide how a rune
should be cased. Casing implementations should call checkpoint on context
whenever it is known to be safe to return the runes processed so far.
It is recommended for implementations to not allow for more than 30 case
ignorables as lookahead (analogous to the limit in norm) and to use state if
unbounded lookahead is needed for cased runes.
atEOFbooldst[]byteerrerror
// case information of currently scanned rune
State preserved across calls to Transform.
// false if next cased letter needs to be title-cased.
checkpoints safe to return in Transform, where nDst <= pDst and nSrc <= pSrc.
checkpoints safe to return in Transform, where nDst <= pDst and nSrc <= pSrc.
// pDst points past the last written rune in dst.
// pSrc points to the start of the currently scanned rune.
src[]byte
// size of current rune
(*T) Reset()
caseType returns an info with only the case bits, normalized to either
cLower, cUpper, cTitle or cUncased.
checkpoint sets the return value buffer points for Transform to the current
positions.
copy writes the current rune to dst.
copyXOR copies the current rune to dst and modifies it by applying the XOR
pattern of the case info. It is the responsibility of the caller to ensure
that this is a rune with a XOR pattern defined.
hasPrefix returns true if src[pSrc:] starts with the given string.
(*T) next() bool
ret returns the return values for the Transform method. It checks whether
there were insufficient bytes in src to complete and introduces an error
accordingly, if necessary.
retSpan returns the return values for the Span method. It checks whether
there were insufficient bytes in src to complete and introduces an error
accordingly, if necessary.
unreadRune causes the last rune read by next to be reread on the next
invocation of next. Only one unreadRune may be called after a call to next.
writeBytes adds bytes to dst.
writeString writes the given string to dst.
func afnlRewrite(c *context)
func aztrLower(c *context) (done bool)
func elUpper(c *context) bool
func finalSigmaBody(c *context) bool
func foldFull(c *context) bool
func isFoldFull(c *context) bool
func isLower(c *context) bool
func isTitle(c *context) bool
func isUpper(c *context) bool
func lower(c *context) bool
func ltLower(c *context) bool
func nlTitle(c *context) bool
func nlTitleSpan(c *context) bool
func noSpan(c *context) bool
func title(c *context) bool
func upper(c *context) bool
info holds case information for a single rune. It is the value returned
by a trie lookup. Most mapping information can be stored in a single 16-bit
value. If not, for example when a rune is mapped to multiple runes, the value
stores some basic case data and an index into an array with additional data.
The per-rune values have the following format:
if (exception) {
15..4 unsigned exception index
} else {
15..8 XOR pattern or index to XOR pattern for case mapping
Only 13..8 are used for XOR patterns.
7 inverseFold (fold to upper, not to lower)
6 index: interpret the XOR pattern as an index
or isMid if case mode is cIgnorableUncased.
5..4 CCC: zero (normal or break), above or other
}
3 exception: interpret this value as an exception index
(TODO: is this bit necessary? Probably implied from case mode.)
2..0 case mode
For the non-exceptional cases, a rune must be either uncased, lowercase or
uppercase. If the rune is cased, the XOR pattern maps either a lowercase
rune to uppercase or an uppercase rune to lowercase (applied to the 10
least-significant bits of the rune).
See the definitions below for a more detailed description of the various
bits.
( T) cccType() info( T) cccVal() info
isBreak returns whether this rune should introduce a break.
( T) isCaseIgnorable() bool( T) isCaseIgnorableAndNotCased() bool( T) isCased() bool
isLetter returns whether the rune is of break type ALetter, Hebrew_Letter,
Numeric, ExtendNumLet, or Extend.
( T) isMid() bool( T) isNotCasedAndNotCaseIgnorable() bool
const cccAbove
const cccBreak
const cccMask
const cccOther
const cccZero
const cIgnorableCased
const cIgnorableUncased
const cLower
const cTitle
const cUncased
const cUpper
const cXORCase
const maxCaseMode
lowerCaser implements the Transformer interface. The default Unicode lower
casing requires different treatment for the first and subsequent characters
of a word, most notably to handle the Greek final Sigma.
undLowerIgnoreSigmaCaser.NopResettertransform.NopResettercontextcontextcontext.atEOFboolcontext.dst[]bytecontext.errerror
// case information of currently scanned rune
State preserved across calls to Transform.
// false if next cased letter needs to be title-cased.
checkpoints safe to return in Transform, where nDst <= pDst and nSrc <= pSrc.
checkpoints safe to return in Transform, where nDst <= pDst and nSrc <= pSrc.
// pDst points past the last written rune in dst.
// pSrc points to the start of the currently scanned rune.
context.src[]byte
// size of current rune
firstmapFuncmidWordmapFuncundLowerIgnoreSigmaCaserundLowerIgnoreSigmaCaser(*T) Reset()
Span implements a generic lower-casing. This is possible as isLower works
for all lowercasing variants. All lowercase variants only vary in how they
transform a non-lowercase letter. They will never change an already lowercase
letter. In addition, there is no state.
(*T) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error)
caseType returns an info with only the case bits, normalized to either
cLower, cUpper, cTitle or cUncased.
checkpoint sets the return value buffer points for Transform to the current
positions.
copy writes the current rune to dst.
copyXOR copies the current rune to dst and modifies it by applying the XOR
pattern of the case info. It is the responsibility of the caller to ensure
that this is a rune with a XOR pattern defined.
hasPrefix returns true if src[pSrc:] starts with the given string.
(*T) next() bool
ret returns the return values for the Transform method. It checks whether
there were insufficient bytes in src to complete and introduces an error
accordingly, if necessary.
retSpan returns the return values for the Span method. It checks whether
there were insufficient bytes in src to complete and introduces an error
accordingly, if necessary.
unreadRune causes the last rune read by next to be reread on the next
invocation of next. Only one unreadRune may be called after a call to next.
writeBytes adds bytes to dst.
writeString writes the given string to dst.
*T : golang.org/x/text/transform.SpanningTransformer
*T : golang.org/x/text/transform.Transformer
*T : vendor/golang.org/x/text/transform.SpanningTransformer
*T : vendor/golang.org/x/text/transform.Transformer
A mapFunc takes a context set to the current rune and writes the mapped
version to the same context. It may advance the context to the next rune. It
returns whether a checkpoint is possible: whether the pDst bytes written to
dst so far won't need changing as we see more source bytes.
func aztrUpper(f mapFunc) mapFunc
func finalSigma(f mapFunc) mapFunc
func ltUpper(f mapFunc) mapFunc
func aztrUpper(f mapFunc) mapFunc
func finalSigma(f mapFunc) mapFunc
func ltUpper(f mapFunc) mapFunc
contextcontextcontext.atEOFboolcontext.dst[]bytecontext.errerror
// case information of currently scanned rune
State preserved across calls to Transform.
// false if next cased letter needs to be title-cased.
checkpoints safe to return in Transform, where nDst <= pDst and nSrc <= pSrc.
checkpoints safe to return in Transform, where nDst <= pDst and nSrc <= pSrc.
// pDst points past the last written rune in dst.
// pSrc points to the start of the currently scanned rune.
context.src[]byte
// size of current rune
fmapFuncspanspanFunc(*T) Reset()(*T) Span(src []byte, atEOF bool) (n int, err error)
simpleCaser implements the Transformer interface for doing a case operation
on a rune-by-rune basis.
caseType returns an info with only the case bits, normalized to either
cLower, cUpper, cTitle or cUncased.
checkpoint sets the return value buffer points for Transform to the current
positions.
copy writes the current rune to dst.
copyXOR copies the current rune to dst and modifies it by applying the XOR
pattern of the case info. It is the responsibility of the caller to ensure
that this is a rune with a XOR pattern defined.
hasPrefix returns true if src[pSrc:] starts with the given string.
(*T) next() bool
ret returns the return values for the Transform method. It checks whether
there were insufficient bytes in src to complete and introduces an error
accordingly, if necessary.
retSpan returns the return values for the Span method. It checks whether
there were insufficient bytes in src to complete and introduces an error
accordingly, if necessary.
unreadRune causes the last rune read by next to be reread on the next
invocation of next. Only one unreadRune may be called after a call to next.
writeBytes adds bytes to dst.
writeString writes the given string to dst.
*T : golang.org/x/text/transform.SpanningTransformer
*T : golang.org/x/text/transform.Transformer
*T : vendor/golang.org/x/text/transform.SpanningTransformer
*T : vendor/golang.org/x/text/transform.Transformer
A spanFunc takes a context set to the current rune and returns whether this
rune would be altered when written to the output. It may advance the context
to the next rune. It returns whether a checkpoint is possible.
titleCaser implements the Transformer interface. Title casing algorithms
distinguish between the first letter of a word and subsequent letters of the
same word. It uses state to avoid requiring a potentially infinite lookahead.
contextcontextcontext.atEOFboolcontext.dst[]bytecontext.errerror
// case information of currently scanned rune
State preserved across calls to Transform.
// false if next cased letter needs to be title-cased.
checkpoints safe to return in Transform, where nDst <= pDst and nSrc <= pSrc.
checkpoints safe to return in Transform, where nDst <= pDst and nSrc <= pSrc.
// pDst points past the last written rune in dst.
// pSrc points to the start of the currently scanned rune.
context.src[]byte
// size of current rune
lowermapFuncrewritefunc(*context)
rune mappings used by the actual casing algorithms.
titleSpanspanFunc(*T) Reset()(*T) Span(src []byte, atEOF bool) (n int, err error)
Transform implements the standard Unicode title case algorithm as defined in
Chapter 3 of The Unicode Standard:
toTitlecase(X): Find the word boundaries in X according to Unicode Standard
Annex #29, "Unicode Text Segmentation." For each word boundary, find the
first cased character F following the word boundary. If F exists, map F to
Titlecase_Mapping(F); then map all characters C between F and the following
word boundary to Lowercase_Mapping(C).
caseType returns an info with only the case bits, normalized to either
cLower, cUpper, cTitle or cUncased.
checkpoint sets the return value buffer points for Transform to the current
positions.
copy writes the current rune to dst.
copyXOR copies the current rune to dst and modifies it by applying the XOR
pattern of the case info. It is the responsibility of the caller to ensure
that this is a rune with a XOR pattern defined.
hasPrefix returns true if src[pSrc:] starts with the given string.
(*T) next() bool
ret returns the return values for the Transform method. It checks whether
there were insufficient bytes in src to complete and introduces an error
accordingly, if necessary.
retSpan returns the return values for the Span method. It checks whether
there were insufficient bytes in src to complete and introduces an error
accordingly, if necessary.
unreadRune causes the last rune read by next to be reread on the next
invocation of next. Only one unreadRune may be called after a call to next.
writeBytes adds bytes to dst.
writeString writes the given string to dst.
*T : golang.org/x/text/transform.SpanningTransformer
*T : golang.org/x/text/transform.Transformer
*T : vendor/golang.org/x/text/transform.SpanningTransformer
*T : vendor/golang.org/x/text/transform.Transformer
undLowerCaser implements the Transformer interface for doing a lower case
mapping for the root locale (und) ignoring final sigma handling. This casing
algorithm is used in some performance-critical packages like secure/precis
and x/net/http/idna, which warrants its special-casing.
NopResettertransform.NopResetter
Reset implements the Reset method of the Transformer interface.
( T) Span(src []byte, atEOF bool) (n int, err error)( T) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error)
T : golang.org/x/text/transform.SpanningTransformer
T : golang.org/x/text/transform.Transformer
T : vendor/golang.org/x/text/transform.SpanningTransformer
T : vendor/golang.org/x/text/transform.Transformer
undLowerIgnoreSigmaCaser implements the Transformer interface for doing
a lower case mapping for the root locale (und) ignoring final sigma
handling. This casing algorithm is used in some performance-critical packages
like secure/precis and x/net/http/idna, which warrants its special-casing.
NopResettertransform.NopResetter
Reset implements the Reset method of the Transformer interface.
Span implements a generic lower-casing. This is possible as isLower works
for all lowercasing variants. All lowercase variants only vary in how they
transform a non-lowercase letter. They will never change an already lowercase
letter. In addition, there is no state.
( T) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error)
T : golang.org/x/text/transform.SpanningTransformer
T : golang.org/x/text/transform.Transformer
T : vendor/golang.org/x/text/transform.SpanningTransformer
T : vendor/golang.org/x/text/transform.Transformer
NopResettertransform.NopResetter
Reset implements the Reset method of the Transformer interface.
( T) Span(src []byte, atEOF bool) (n int, err error)
undUpperCaser implements the Transformer interface for doing an upper case
mapping for the root locale (und). It eliminates the need for an allocation
as it prevents escaping by not using function pointers.
T : golang.org/x/text/transform.SpanningTransformer
T : golang.org/x/text/transform.Transformer
T : vendor/golang.org/x/text/transform.SpanningTransformer
T : vendor/golang.org/x/text/transform.Transformer
Package-Level Functions (total 35, in which 5 are exported)
Fold returns a Caser that implements Unicode case folding. The returned Caser
is stateless and safe to use concurrently by multiple goroutines.
Case folding does not normalize the input and may not preserve a normal form.
Use the collate or search package for more convenient and linguistically
sound comparisons. Use golang.org/x/text/secure/precis for string comparisons
where security aspects are a concern.
HandleFinalSigma specifies whether the special handling of Greek final sigma
should be enabled. Unicode prescribes handling the Greek final sigma for all
locales, but standards like IDNA and PRECIS override this default.
Lower returns a Caser for language-specific lowercasing.
Title returns a Caser for language-specific title casing. It uses an
approximation of the default Unicode Word Break algorithm.
Upper returns a Caser for language-specific uppercasing.
Not part of CLDR, but see https://unicode.org/cldr/trac/ticket/7078.
elUpper implements Greek upper casing, which entails removing a predefined
set of non-blocked modifiers. Note that these accents should not be removed
for title casing!
Example: "Οδός" -> "ΟΔΟΣ".
finalSigma adds Greek final Sigma handing to another casing function. It
determines whether a lowercased sigma should be σ or ς, by looking ahead for
case-ignorables and a cased letters.
The case mapping implementation will need to know about various Canonical
Combining Class (CCC) values. We encode two of these in the trie value:
cccZero (0) and cccAbove (230). If the value is cccOther, it means that
CCC(r) > 0, but not 230. A value of cccBreak means that CCC(r) == 0 and that
the rune also has the break category Break (see below).
The case mapping implementation will need to know about various Canonical
Combining Class (CCC) values. We encode two of these in the trie value:
cccZero (0) and cccAbove (230). If the value is cccOther, it means that
CCC(r) > 0, but not 230. A value of cccBreak means that CCC(r) == 0 and that
the rune also has the break category Break (see below).
The case mapping implementation will need to know about various Canonical
Combining Class (CCC) values. We encode two of these in the trie value:
cccZero (0) and cccAbove (230). If the value is cccOther, it means that
CCC(r) > 0, but not 230. A value of cccBreak means that CCC(r) == 0 and that
the rune also has the break category Break (see below).
The case mapping implementation will need to know about various Canonical
Combining Class (CCC) values. We encode two of these in the trie value:
cccZero (0) and cccAbove (230). If the value is cccOther, it means that
CCC(r) > 0, but not 230. A value of cccBreak means that CCC(r) == 0 and that
the rune also has the break category Break (see below).
The case mapping implementation will need to know about various Canonical
Combining Class (CCC) values. We encode two of these in the trie value:
cccZero (0) and cccAbove (230). If the value is cccOther, it means that
CCC(r) > 0, but not 230. A value of cccBreak means that CCC(r) == 0 and that
the rune also has the break category Break (see below).
The case mode bits encodes the case type of a rune. This includes uncased,
title, upper and lower case and case ignorable. (For a definition of these
terms see Chapter 3 of The Unicode Standard Core Specification.) In some rare
cases, a rune can be both cased and case-ignorable. This is encoded by
cIgnorableCased. A rune of this type is always lower case. Some runes are
cased while not having a mapping.
A common pattern for scripts in the Unicode standard is for upper and lower
case runes to alternate for increasing rune values (e.g. the accented Latin
ranges starting from U+0100 and U+1E00 among others and some Cyrillic
characters). We use this property by defining a cXORCase mode, where the case
mode (always upper or lower case) is derived from the rune value. As the XOR
pattern for case mappings is often identical for successive runes, using
cXORCase can result in large series of identical trie values. This, in turn,
allows us to better compress the trie blocks.
The case mode bits encodes the case type of a rune. This includes uncased,
title, upper and lower case and case ignorable. (For a definition of these
terms see Chapter 3 of The Unicode Standard Core Specification.) In some rare
cases, a rune can be both cased and case-ignorable. This is encoded by
cIgnorableCased. A rune of this type is always lower case. Some runes are
cased while not having a mapping.
A common pattern for scripts in the Unicode standard is for upper and lower
case runes to alternate for increasing rune values (e.g. the accented Latin
ranges starting from U+0100 and U+1E00 among others and some Cyrillic
characters). We use this property by defining a cXORCase mode, where the case
mode (always upper or lower case) is derived from the rune value. As the XOR
pattern for case mappings is often identical for successive runes, using
cXORCase can result in large series of identical trie values. This, in turn,
allows us to better compress the trie blocks.
The case mode bits encodes the case type of a rune. This includes uncased,
title, upper and lower case and case ignorable. (For a definition of these
terms see Chapter 3 of The Unicode Standard Core Specification.) In some rare
cases, a rune can be both cased and case-ignorable. This is encoded by
cIgnorableCased. A rune of this type is always lower case. Some runes are
cased while not having a mapping.
A common pattern for scripts in the Unicode standard is for upper and lower
case runes to alternate for increasing rune values (e.g. the accented Latin
ranges starting from U+0100 and U+1E00 among others and some Cyrillic
characters). We use this property by defining a cXORCase mode, where the case
mode (always upper or lower case) is derived from the rune value. As the XOR
pattern for case mappings is often identical for successive runes, using
cXORCase can result in large series of identical trie values. This, in turn,
allows us to better compress the trie blocks.
The case mode bits encodes the case type of a rune. This includes uncased,
title, upper and lower case and case ignorable. (For a definition of these
terms see Chapter 3 of The Unicode Standard Core Specification.) In some rare
cases, a rune can be both cased and case-ignorable. This is encoded by
cIgnorableCased. A rune of this type is always lower case. Some runes are
cased while not having a mapping.
A common pattern for scripts in the Unicode standard is for upper and lower
case runes to alternate for increasing rune values (e.g. the accented Latin
ranges starting from U+0100 and U+1E00 among others and some Cyrillic
characters). We use this property by defining a cXORCase mode, where the case
mode (always upper or lower case) is derived from the rune value. As the XOR
pattern for case mappings is often identical for successive runes, using
cXORCase can result in large series of identical trie values. This, in turn,
allows us to better compress the trie blocks.
The case mode bits encodes the case type of a rune. This includes uncased,
title, upper and lower case and case ignorable. (For a definition of these
terms see Chapter 3 of The Unicode Standard Core Specification.) In some rare
cases, a rune can be both cased and case-ignorable. This is encoded by
cIgnorableCased. A rune of this type is always lower case. Some runes are
cased while not having a mapping.
A common pattern for scripts in the Unicode standard is for upper and lower
case runes to alternate for increasing rune values (e.g. the accented Latin
ranges starting from U+0100 and U+1E00 among others and some Cyrillic
characters). We use this property by defining a cXORCase mode, where the case
mode (always upper or lower case) is derived from the rune value. As the XOR
pattern for case mappings is often identical for successive runes, using
cXORCase can result in large series of identical trie values. This, in turn,
allows us to better compress the trie blocks.
The case mode bits encodes the case type of a rune. This includes uncased,
title, upper and lower case and case ignorable. (For a definition of these
terms see Chapter 3 of The Unicode Standard Core Specification.) In some rare
cases, a rune can be both cased and case-ignorable. This is encoded by
cIgnorableCased. A rune of this type is always lower case. Some runes are
cased while not having a mapping.
A common pattern for scripts in the Unicode standard is for upper and lower
case runes to alternate for increasing rune values (e.g. the accented Latin
ranges starting from U+0100 and U+1E00 among others and some Cyrillic
characters). We use this property by defining a cXORCase mode, where the case
mode (always upper or lower case) is derived from the rune value. As the XOR
pattern for case mappings is often identical for successive runes, using
cXORCase can result in large series of identical trie values. This, in turn,
allows us to better compress the trie blocks.
The case mode bits encodes the case type of a rune. This includes uncased,
title, upper and lower case and case ignorable. (For a definition of these
terms see Chapter 3 of The Unicode Standard Core Specification.) In some rare
cases, a rune can be both cased and case-ignorable. This is encoded by
cIgnorableCased. A rune of this type is always lower case. Some runes are
cased while not having a mapping.
A common pattern for scripts in the Unicode standard is for upper and lower
case runes to alternate for increasing rune values (e.g. the accented Latin
ranges starting from U+0100 and U+1E00 among others and some Cyrillic
characters). We use this property by defining a cXORCase mode, where the case
mode (always upper or lower case) is derived from the rune value. As the XOR
pattern for case mappings is often identical for successive runes, using
cXORCase can result in large series of identical trie values. This, in turn,
allows us to better compress the trie blocks.
lastRuneForTesting is the last rune used for testing. Everything after this
is boring.
The exceptions slice holds data that does not fit in a normal info entry.
The entry is pointed to by the exception index in an entry. It has the
following format:
Header
byte 0:
7..6 unused
5..4 CCC type (same bits as entry)
3 unused
2..0 length of fold
byte 1:
7..6 unused
5..3 length of 1st mapping of case type
2..0 length of 2nd mapping of case type
case 1st 2nd
lower -> upper, title
upper -> lower, title
title -> lower, upper
Lengths with the value 0x7 indicate no value and implies no change.
A length of 0 indicates a mapping to zero-length string.
Body bytes:
case folding bytes
lowercase mapping bytes
uppercase mapping bytes
titlecase mapping bytes
closure mapping bytes (for NFKC_Casefold). (TODO)
Fallbacks:
missing fold -> lower
missing title -> upper
all missing -> original rune
exceptions starts with a dummy byte to enforce that there is no zero index
value.
The exceptions slice holds data that does not fit in a normal info entry.
The entry is pointed to by the exception index in an entry. It has the
following format:
Header
byte 0:
7..6 unused
5..4 CCC type (same bits as entry)
3 unused
2..0 length of fold
byte 1:
7..6 unused
5..3 length of 1st mapping of case type
2..0 length of 2nd mapping of case type
case 1st 2nd
lower -> upper, title
upper -> lower, title
title -> lower, upper
Lengths with the value 0x7 indicate no value and implies no change.
A length of 0 indicates a mapping to zero-length string.
Body bytes:
case folding bytes
lowercase mapping bytes
uppercase mapping bytes
titlecase mapping bytes
closure mapping bytes (for NFKC_Casefold). (TODO)
Fallbacks:
missing fold -> lower
missing title -> upper
all missing -> original rune
exceptions starts with a dummy byte to enforce that there is no zero index
value.
The case mode bits encodes the case type of a rune. This includes uncased,
title, upper and lower case and case ignorable. (For a definition of these
terms see Chapter 3 of The Unicode Standard Core Specification.) In some rare
cases, a rune can be both cased and case-ignorable. This is encoded by
cIgnorableCased. A rune of this type is always lower case. Some runes are
cased while not having a mapping.
A common pattern for scripts in the Unicode standard is for upper and lower
case runes to alternate for increasing rune values (e.g. the accented Latin
ranges starting from U+0100 and U+1E00 among others and some Cyrillic
characters). We use this property by defining a cXORCase mode, where the case
mode (always upper or lower case) is derived from the rune value. As the XOR
pattern for case mappings is often identical for successive runes, using
cXORCase can result in large series of identical trie values. This, in turn,
allows us to better compress the trie blocks.
maxIgnorable defines the maximum number of ignorables to consider for
lookahead operations.
The exceptions slice holds data that does not fit in a normal info entry.
The entry is pointed to by the exception index in an entry. It has the
following format:
Header
byte 0:
7..6 unused
5..4 CCC type (same bits as entry)
3 unused
2..0 length of fold
byte 1:
7..6 unused
5..3 length of 1st mapping of case type
2..0 length of 2nd mapping of case type
case 1st 2nd
lower -> upper, title
upper -> lower, title
title -> lower, upper
Lengths with the value 0x7 indicate no value and implies no change.
A length of 0 indicates a mapping to zero-length string.
Body bytes:
case folding bytes
lowercase mapping bytes
uppercase mapping bytes
titlecase mapping bytes
closure mapping bytes (for NFKC_Casefold). (TODO)
Fallbacks:
missing fold -> lower
missing title -> upper
all missing -> original rune
exceptions starts with a dummy byte to enforce that there is no zero index
value.
The pages are generated with Goldsv0.3.2-preview. (GOOS=darwin GOARCH=amd64)
Golds is a Go 101 project developed by Tapir Liu.
PR and bug reports are welcome and can be submitted to the issue list.
Please follow @Go100and1 (reachable from the left QR code) to get the latest news of Golds.