Questions tagged [astral-plane]

Unicode characters beyond the 16-bit Basic Multilingual plane. Those which require surrogate pairs in languages with UTF-16 as their native text encoding.

43 questions
124
votes
3 answers

What are the most common non-BMP Unicode characters in actual use?

In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16. I would've expected the answer to be…
hippietrail
  • 15,848
  • 18
  • 99
  • 158
41
votes
6 answers

JavaScript strings outside of the BMP

BMP being Basic Multilingual Plane According to JavaScript: the Good Parts: JavaScript was built at a time when Unicode was a 16-bit character set, so all characters in JavaScript are 16 bits wide. This leads me to believe that JavaScript uses…
Delan Azabani
  • 79,602
  • 28
  • 170
  • 210
27
votes
4 answers

What version of Unicode is supported by which .NET platform and on which version of Windows in regards to character classes?

Updated question ¹ With regards to character classes, comparison, sorting, normalization and collations, what Unicode version or versions are supported by which .NET platforms? Original question I remember somewhat vaguely having read that .NET…
Abel
  • 56,041
  • 24
  • 146
  • 247
21
votes
6 answers

How would you get an array of Unicode code points from a .NET String?

I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char's in a string, I don't…
Neil C. Obremski
  • 18,696
  • 24
  • 83
  • 112
19
votes
2 answers

Java regex match characters outside Basic Multilingual Plane

How can I match characters (with the intention of removing them) from outside the unicode Basic Multilingual Plane in java?
ʞɔıu
  • 47,148
  • 35
  • 106
  • 149
17
votes
4 answers

char to Unicode more than U+FFFF in java?

How can I display a Unicode Character above U+FFFF using char in Java? I need something like this (if it were valid): char u = '\u+10FFFF';
liuyuqing
  • 171
  • 1
  • 3
17
votes
4 answers

Java charAt used with characters that have two code units

From Core Java, vol. 1, 9th ed., p. 69: The character ℤ requires two code units in the UTF-16 encoding. Calling String sentence = "ℤ is the set of integers"; // for clarity; not in book char ch = sentence.charAt(1) doesn't return a space but the…
Patrick Brinich-Langlois
  • 1,381
  • 1
  • 15
  • 29
14
votes
4 answers

Unicode characters from charcode in javascript for charcodes > 0xFFFF

I need to get a string / char from a unicode charcode and finally put it into a DOM TextNode to add into an HTML page using client side JavaScript. Currently, I am doing: String.fromCharCode(parseInt(charcode, 16)); where charcode is a hex string…
leemes
  • 44,967
  • 21
  • 135
  • 183
9
votes
4 answers

In Windows, how do you enter a character outside of the Unicode Basic Multilingual Plane?

I know that Windows has supported supplemental planes since Windows XP. I have fonts which I know have characters outside the basic multilingual plane (BMP). For these characters, the Unicode codepoint consists of five hexadecimal digits. I do not…
yam655
  • 206
  • 2
  • 8
9
votes
2 answers

How to enter non-BMP unicode (hexadecimal with more than 4 characters) as input to Mathematica

Problem description: Mathematica use "\:nnnn" as the syntax for unicode input. E.g., if we enter "\:6c34", we get "水" ("water" in Chinese). But what if one wants to enter "\:1f618" (face throwing a kiss). When I tried this, I got "ὡ8", not "a…
Ning
  • 2,850
  • 2
  • 16
  • 23
9
votes
1 answer

How are 4 bytes characters represented in C#

How are 4 bytes chars are represented in C#? Like one char or a set of 2 chars? var someCharacter = 'x'; //put 4 bytes UTF-16 character
SiberianGuy
  • 24,674
  • 56
  • 152
  • 266
9
votes
2 answers

how to render 32bit unicode characters in google v8 (and nodejs)

does anyone have an idea how to render unicode 'astral plane' characters (whose CIDs are beyond 0xffff) in google v8, the javascript vm that drives both google chrome and nodejs? funnily enough, when i give google chrome (it identifies as…
flow
  • 3,624
  • 36
  • 48
8
votes
2 answers

C# Regular Expressions with \Uxxxxxxxx characters in the pattern

Regex.IsMatch( "foo", "[\U00010000-\U0010FFFF]" ) Throws: System.ArgumentException: parsing "[-]" - [x-y] range in reverse order. Looking at the hex values for \U00010000 and \U0010FFF I get: 0xd800 0xdc00 for the first character and 0xdbff…
Ben McNiel
  • 8,661
  • 10
  • 36
  • 38
8
votes
2 answers

Unicode Supplementary Multilingual Plane in Java

I want to work with SMP(Supplementary Multilingual Plane) in Java. Actually, I want to print a character whose codepoint is more than 0xFFFF. I used this line of code: int hexCodePoint = Character.toCodePoint('\uD801', '\uDC02' ); to have the…
Shadi
  • 93
  • 1
  • 6
8
votes
1 answer

Can MongoDB store and manipulate strings of UTF-8 with code points outside the basic multilingual plane?

In MongoDB 2.0.6, when attempting to store documents or query documents that contain string fields, where the value of a string include characters outside the BMP, I get a raft of errors like: "Not proper UTF-16: 55357", or "buffer too small" What…
Eli
  • 227
  • 1
  • 3
  • 11
1
2 3