Universal Character Set-4 is a 31-bit encoding form defined by the original ISO 10646, and is largely replaced by UTF-32. It can represent up to 2,147,483,648 characters from `0x00000000` to `0x7FFFFFFF`. Use this tag when you are specifically dealing with UCS-4.
Unicode Character Set-4 is a precursor to Unicode encoding. It is a fixed-length encoding scheme of characters, where each character takes up 32 bits, or four bytes (hence the '4' part in UCS-4).
The leading sign bit is unused, leaving 31 bits used to encode each of the potential 2,147,483,648 characters that it can be encoded from 0x00000000
to 0x7FFFFFFF
.
UCS-4 is now superseded by UTF-32, where each of the 1,114,112 possible Unicode code points in 17 planes of 65536 code points take up four bytes, and also, only code points 0x0000
to 0x10FFFF
are considerd to be in range. The UTF-32 character encodings are almost completely identical to that used by the UCS-4. UCS-4 therefore covers all Unicode characters that can be encoded by a UTF format.
Examples of UCS-4 encodings (all of them big endian):
- Character
'0'
is stored as0x00000030
, using four bytes, rather than one-byte0x30
in ASCII or UTF-8, or two-byte0x0030
in UTF-16. - Replacement character
'�'
is stored as0x0000FFFD
, again using four bytes, rather than three-byte0xEF 0xBF 0xBD
in UTF-8 or two-byte0xFFFD
in UTF-16. - Emoji
''
is stored as0x0001F606
, again using four bytes, but not using surrogates0xD83D 0xDE06
in UTF-16, or four bytes like0xF0 0x9F 0x98 0x86
in UTF-8. - Code points above
0x10FFFF
are not in Unicode range and are not to be used.
Related Tags:
- utf-32, UCS-4's most direct successor
- utf-8, utf-16, other Unicode encodings
- unicode, ucs
- ucs2, where each of the 65536 characters take up two bytes
Read More: