Since games are typically localized to various languages, there is a need to deal with potentially many languages while dealing with text data.
This document will provide some general information as well as CRYENGINE specific information aimed at programmers that will work with the source code.
As such, for asset creation and design, please refer to Localization section for information on how to perform localization and create localized assets.
Starting with CRYENGINE version 3.8.1 we have done significant work in this area. This document applies to this version and newer versions.
On this page, we will deal with some terminology and acronyms that are important to know about to understand this document.
They are specific to the context of text processing. For more information on a specific term, consider using google or wikipedia entries.
Term | Description |
---|---|
character | A unit of textual data, can be a glyph or formatting indicator. Note that a glyph does not necessarily form a single visible "nit. Example, a diacretical mark ´ and a letter a are separate glyphs (and characters), but overlaid they form á |
Unicode | A standard maintained by the Unicode Consortium that deals with text and language standardization |
UCS | Universal Character Set, the standardized set of characters in the Unicode standard (also, ISO-10646) |
(UCS) code-point | An integral identifier for a single character in the UCS defined range, typically displayed with the U prefix followed by hexadecimal: U+12AB |
(text) encoding | (noun) A method of mapping (a subset of) UCS to a sequence of code-units (verb) The process of applying an encoding (the noun) |
code-unit | An encoding-specific unit integral identifier used to encode code-points. Many code-units may be used to represent a single code-point. |
ASCII | A standardized encoding that covers the first 128 code-points of the UCS space using 7- or 8-bit code-units |
(ANSI) code-page | A standardized encoding that extends ASCII by assigning additional meaning to the higher 128 values when using 8-bit code-units There are many hundreds of code-pages, some of which use multi-byte sequences to encode code-points |
UTF | UCS Transformation Format, a standardized encoding that covers the entire UCS space |
UTF-8 | A specific instance of UTF, using 8-bit code-units. Each code-point can take 1 to 4 (inclusive) code-units |
UTF-16 | A specific instance of UTF, using 16-bit code-units. Each code-point can take 1 or 2 code-units |
UTF-32 | A specific instance of UTF, using 32-bit code-units. Each code-point is directly mapped to a single code-unit |
byte-order | A way that a CPU treats a sequence of bytes when interpreting multi-byte values. Typically either little-endian or big-endian format |
encoding error | A sequence of code-units that does not form a code-point (or an invalid code-point, as defined by the Unicode standard) |
Since there are many methods of encoding text, the question that should be asked when dealing with even the smallest amount of text is, in what encoding is this stored?
This is important, because decoding a sequence of code-units with a different encoding will lead to encoding errors (or worse, valid decoding yielding the wrong content)
An overview of some encodings and their properties:
encoding | code-unit size | code-point size | maps entire UCS space? | trivial to encode/decode? | immune to byte-order differences? | major users |
---|---|---|---|---|---|---|
ASCII | 7 bits | 1 byte | no | yes | yes | Many english-only apps |
(ANSI) code-page | 8 bits | varies, usually 1 byte | no | varies, usually yes | yes | Older OS functions |
UTF-8 | 8 bits | 1 to 4 bytes | yes | no | yes | Most things on the internet, XML |
UTF-16 | 16 bits | 2 to 4 bytes | yes | yes | no | Windows "wide" API, Qt |
UCS-2 | 16 bits | 2 bytes | no | yes | no | None (replaced with UTF-16) |
UTF-32 UCS-4 | 32 bits | 4 bytes | yes | yes | no | Linux "wide" API |
As you can see, there is no single "best" encoding, so there is always the usage scenario to consider when picking an encoding.
Historically, this has lead to many operating systems and software packages to pick a (set of) supported encodings, but they seem to have picked different ones.
Even different platforms in C++ have different conventions for their "wide character" wchar_t, for example, it is 16-bits on Windows but 32-bits on Linux.
Since CRYENGINE products can be used on many platforms and in many languages, we would like full UCS coverage.
Here is the conventions that Crytek is moving towards with CRYENGINE:
text data type | encoding | reason |
---|---|---|
source code | ASCII | We write our code in basic English, which means ASCII is sufficient. |
text assets | UTF-8 | Assets can be transferred between machines with potentially differing byte-order, and may contain text in many languages. |
run-time variables | UTF-8 | Since transforming text data from/to UTF-8 is not free, we keep the data in UTF-8 as much as possible. Exceptions must be made when interacting with libraries/OS that require another encoding, in which case all transformations should be done at the call-site. |
file and path names | ASCII | File names are a special case with regards to equality, most commonly case-(in)sensitivity, as defined by the file system. Unicode defines 3 cases, and conversions between them are locale-specific. In addition, the normalization formats are typically not (all) accounted for in file-systems and their APIs. Note: Some specialized file-systems (some console cloud storage DBs) only accept ASCII. This combination means that using the most basic and portable sub-set should be preferred, fall back to UTF-8 only as required. |
In general:
Since the vast majority of text uses single-byte code-units, this means a single-byte string type can be used pretty much everywhere.
In addition, since we do not deal with ANSI code-pages, we can be certain that all text is either ASCII or UTF-8.
The following properties hold for both ASCII and UTF-8
The different between position and length in code-units (as reported through string::length(), strlen() etc) and their matching position/length in code-points is largely irrelevant.
This is because the meaning of the sequence is typically abstract, and only when the text is interpreted (ie, when displaying) does it matter what the bytes mean.
That said, there are some caveats:
In general, avoid interpreting text if at all possible. Otherwise, try to operate on the ASCII subset and treat all other text parts as opaque indivisible sequences.
When dealing with the concept of "length" or "size", try to use the consider that in code-units instead of code-points, since those operations are cheaper.
(In fact, the concept of "length" of Unicode sequences is complex, there is a many-to-many mapping between code-points and what is actually displayed)
In general, always:
Starting with CRYENGINE 3.8.1, we have added some utilities to make it easy to losslessly and safely convert text between Unicode encodings.
In-depth technical details are provided in the header files that expose these utilities: UnicodeFunctions.h
and UnicodeIterator.h
.
The most common use-cases:
string utf8;
wstring wide;
Unicode::Convert(utf8, wide); // Convert contents of wide string and store into UTF-8 string
Unicode::Convert(wide, utf8); // Convert contents of UTF-8 string to wide string
string ascii;
Unicode::Convert<Unicode::eEncoding_ASCII, Unicode::eEncoding_UTF8>(ascii, utf8); // Convert UTF-8 to ASCII (lossy!)
Note that all of the above functions assume that the input text is already validly encoded.
To guard against user-input or potentially broken input, consider using the Unicode::ConvertSafe function instead.
A good introduction can be found at http://www.joelonsoftware.com/articles/Unicode.html.
See http://unicode.org/ for the official standard and more details on Unicode.