unicode - Computer Definition
An international character set that was built to represent all characters using a 2-byte (16-bit) format. About 30,000 characters from languages around the globe have been assigned characters in a format agreed upon internationally.
The programming language Java and the Windows operating system use Unicode characters by storing them in memory as 16-bit values. In the C/C++ programming language, a character is 8 bits. In Windows and Java, “utilizing Unicode” means using UTF-16 as the character-encoding standard to not only manipulate text in memory but also pass strings to APIs. Windows developers interchangeably use the terms “Unicode string” and “wide string” (meaning “a string of 16-bit characters”).
See Also: Bit and Bit Challenges; Programming Languages C, C++, Perl, and Java.
Orendorff, J. Unicode for Programmers (draft). [Online, March 1, 2002.] Orendorff Website. http://www.jorendorff.com/articles/unicode/index.html.
A character code that defines every character in most of the speaking languages in the world. Although commonly thought to be only a two-byte coding system, Unicode characters can use only one byte, or up to four bytes, to hold a Unicode "code point" (see below). The code point is a unique number for a character or some symbol such as an accent mark or ligature. Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart). Character Encoding Schemes There are several formats for storing Unicode code points. When combined with the byte order of the hardware (big endian or little endian), they are known officially as "character encoding schemes." They are also known by their UTF acronyms, which stand for "Unicode Transformation Format" or "Universal Character Set Transformation Format." UTF-8, 16 and 32 The UTF-8 coding scheme is widely used because words from multiple languages and every type of symbol can be mixed together in the same message without having to reserve multiple bytes for every character as in UTF-16 or UTF-32. With UTF-8, if only ASCII text is required, a single byte is used per character with the high-order bit set to 0. If non-ASCII characters require more than one byte, the high-order 1 bits of the byte define how many bytes are used. See byte order, DBCS and emoji. Unicode ISO Number Coding 10646 of Scheme Equivalent Bytes Order** UTF-8 1-4 BE or LE UTF-16 (UCS-2) 2 BE or LE UTF-16BE (UCS-2) 2 BE UTF-16LE (UCS-2) 2 LE UTF-32 (UCS-4) 4 BE or LE UTF-32BE (UCS-4) 4 BE UTF-32LE (UCS-4) 4 LE Pure ASCII (compatible with early 7-bit e-mail systems) UTF-7 1-4 BE or LE **Byte Order (see byte order) BE = big endian LE = little endian