Questions tagged [surrogate-pairs]

Unicode characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called **surrogate pairs**.

Unicode characters outside the Basic Multilingual Plane, that is characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called surrogate pairs, by the following scheme:

  • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF;
  • the top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or high surrogate, which will be in the range 0xD800..0xDBFF;
  • the low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
111 questions
187
votes
7 answers

What is a "surrogate pair" in Java?

I was reading the documentation for StringBuffer, in particular the reverse() method. That documentation mentions something about surrogate pairs. What is a surrogate pair in this context? And what are low and high surrogates?
Raymond
  • 2,004
  • 2
  • 13
  • 10
124
votes
3 answers

What are the most common non-BMP Unicode characters in actual use?

In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16. I would've expected the answer to be…
hippietrail
  • 15,848
  • 18
  • 99
  • 158
52
votes
2 answers

How can I convert surrogate pairs to normal string in Python?

This is a follow-up to How can I convert JSON-encoded data that contains Unicode surrogate pairs to string?. In that question, the OP had a json.dumps()-encoded file with an emoji represented as a surrogate pair - \ud83d\ude4f. They were having…
MattDMo
  • 100,794
  • 21
  • 241
  • 231
43
votes
2 answers

How to use unicode in Android resource?

I want to use this unicode character in my resource file. But whatever I do, I end with dalvikvm crash (tested with Android 2.3 and 4.2.2): W/dalvikvm( 8797): JNI WARNING: input is not valid Modified UTF-8: illegal start byte 0xf0 W/dalvikvm( 8797):…
Pitel
  • 5,334
  • 7
  • 45
  • 72
41
votes
6 answers

JavaScript strings outside of the BMP

BMP being Basic Multilingual Plane According to JavaScript: the Good Parts: JavaScript was built at a time when Unicode was a 16-bit character set, so all characters in JavaScript are 16 bits wide. This leads me to believe that JavaScript uses…
Delan Azabani
  • 79,602
  • 28
  • 170
  • 210
32
votes
4 answers

Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters")

Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode). JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not…
hippietrail
  • 15,848
  • 18
  • 99
  • 158
17
votes
4 answers

Java charAt used with characters that have two code units

From Core Java, vol. 1, 9th ed., p. 69: The character ℤ requires two code units in the UTF-16 encoding. Calling String sentence = "ℤ is the set of integers"; // for clarity; not in book char ch = sentence.charAt(1) doesn't return a space but the…
Patrick Brinich-Langlois
  • 1,381
  • 1
  • 15
  • 29
15
votes
7 answers

Why UTF-32 instead of UTF-16 if we have surrogate pairs?

If I understand correctly, UTF-32 can handle every character in the universe. So can UTF-16, through the use of surrogate pairs. So is there any good reason to use UTF-32 instead of UTF-16?
zildjohn01
  • 11,339
  • 6
  • 52
  • 58
15
votes
2 answers

How do I create a string with a surrogate pair inside of it?

I saw this post on Jon Skeet's blog where he talks about string reversing. I wanted to try the example he showed myself, but it seems to work... which leads me to believe that I have no idea how to create a string that contains a surrogate pair…
michael
  • 14,844
  • 28
  • 89
  • 177
14
votes
3 answers

Python: getting correct string length when it contains surrogate pairs

Consider the following exchange on IPython: In [1]: s = u'華袞與緼同歸' In [2]: len(s) Out[2]: 8 The correct output should have been 7, but because the fifth of these seven Chinese characters has a high Unicode code-point, it is represented in UTF-8 by…
brannerchinese
  • 1,909
  • 5
  • 24
  • 40
14
votes
5 answers

How to remove surrogate characters in Java?

I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database. I have…
Slowcoder
  • 2,060
  • 3
  • 16
  • 21
12
votes
2 answers

Python: Find equivalent surrogate pair from non-BMP unicode char

The answer presented here: How to work with surrogate pairs in Python? tells you how to convert a surrogate pair, such as '\ud83d\ude4f' into a single non-BMP unicode character (the answer being "\ud83d\ude4f".encode('utf-16',…
hilssu
  • 416
  • 4
  • 18
12
votes
4 answers

Java Can't Open a File with Surrogate Unicode Values in the Filename?

I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM…
Bear
  • 121
  • 1
  • 1
  • 3
11
votes
2 answers

Detecting and Retrieving codepoints and surrogates from a Delphi String

I am trying to better understand surrogate pairs and Unicode implementation in Delphi. If I call length() on the Unicode string S := 'Ĥà̲V̂e' in Delphi, I will get back, 8. This is because the lengths of the individual characters…
sse
  • 987
  • 1
  • 11
  • 30
10
votes
3 answers

Issue with surrogate unicode characters in F#

I'm working with strings, which could contain surrogate unicode characters (non-BMP, 4 bytes per character). When I use "\Uxxxxxxxxv" format to specify surrogate character in F# - for some characters it gives different result than in the case of…
Vitaliy
  • 2,744
  • 1
  • 24
  • 39
1
2 3 4 5 6 7 8