Questions tagged [surrogate-pairs]

Unicode characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called **surrogate pairs**.

Unicode characters outside the Basic Multilingual Plane, that is characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called surrogate pairs, by the following scheme:

0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF;
the top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or high surrogate, which will be in the range 0xD800..0xDBFF;
the low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.

111 questions

187

votes

7 answers

What is a "surrogate pair" in Java?

I was reading the documentation for StringBuffer, in particular the reverse() method. That documentation mentions something about surrogate pairs. What is a surrogate pair in this context? And what are low and high surrogates?

asked May 05 '11 at 19:21

Raymond

2,004
2
13
10

124

votes

3 answers

What are the most common non-BMP Unicode characters in actual use?

In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16. I would've expected the answer to be…

unicode cjk codepoint surrogate-pairs astral-plane

asked Apr 06 '11 at 13:36

hippietrail

15,848
18
99
158

votes

2 answers

How can I convert surrogate pairs to normal string in Python?

This is a follow-up to How can I convert JSON-encoded data that contains Unicode surrogate pairs to string?. In that question, the OP had a json.dumps()-encoded file with an emoji represented as a surrogate pair - \ud83d\ude4f. They were having…

python python-3.x unicode surrogate-pairs

asked Jul 01 '16 at 13:55

MattDMo

100,794
21
241
231

votes

2 answers

How to use unicode in Android resource?

I want to use this unicode character in my resource file. But whatever I do, I end with dalvikvm crash (tested with Android 2.3 and 4.2.2): W/dalvikvm( 8797): JNI WARNING: input is not valid Modified UTF-8: illegal start byte 0xf0 W/dalvikvm( 8797):…

android unicode utf-8 surrogate-pairs

asked May 28 '13 at 07:55

Pitel

5,334
7
45
72

votes

6 answers

JavaScript strings outside of the BMP

BMP being Basic Multilingual Plane According to JavaScript: the Good Parts: JavaScript was built at a time when Unicode was a 16-bit character set, so all characters in JavaScript are 16 bits wide. This leads me to believe that JavaScript uses…

javascript unicode utf-16 surrogate-pairs astral-plane

asked Sep 19 '10 at 06:17

Delan Azabani

79,602
28
170
210

votes

4 answers

Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters")

Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode). JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not…

javascript string unicode codepoint surrogate-pairs

asked Jan 28 '14 at 05:09

hippietrail

15,848
18
99
158

votes

4 answers

Java charAt used with characters that have two code units

From Core Java, vol. 1, 9th ed., p. 69: The character ℤ requires two code units in the UTF-16 encoding. Calling String sentence = "ℤ is the set of integers"; // for clarity; not in book char ch = sentence.charAt(1) doesn't return a space but the…

java unicode utf-16 surrogate-pairs astral-plane

asked Jan 04 '13 at 03:05

Patrick Brinich-Langlois

1,381
1
15
29

votes

7 answers

Why UTF-32 instead of UTF-16 if we have surrogate pairs?

If I understand correctly, UTF-32 can handle every character in the universe. So can UTF-16, through the use of surrogate pairs. So is there any good reason to use UTF-32 instead of UTF-16?

unicode surrogate-pairs

asked Mar 09 '09 at 04:31

zildjohn01

11,339
6
52
58

votes

2 answers

How do I create a string with a surrogate pair inside of it?

I saw this post on Jon Skeet's blog where he talks about string reversing. I wanted to try the example he showed myself, but it seems to work... which leads me to believe that I have no idea how to create a string that contains a surrogate pair…

c# string utf-16 utf-32 surrogate-pairs

asked Jan 15 '13 at 22:06

michael

14,844
28
89
177

votes

3 answers

Python: getting correct string length when it contains surrogate pairs

Consider the following exchange on IPython: In [1]: s = u'華袞與緼同歸' In [2]: len(s) Out[2]: 8 The correct output should have been 7, but because the fifth of these seven Chinese characters has a high Unicode code-point, it is represented in UTF-8 by…

python surrogate-pairs

asked Oct 16 '12 at 03:14

brannerchinese

1,909
5
24
40

votes

5 answers

How to remove surrogate characters in Java?

I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database. I have…

java string surrogate-pairs

asked Oct 12 '12 at 20:57

Slowcoder

2,060
3
16
21

votes

2 answers

Python: Find equivalent surrogate pair from non-BMP unicode char

The answer presented here: How to work with surrogate pairs in Python? tells you how to convert a surrogate pair, such as '\ud83d\ude4f' into a single non-BMP unicode character (the answer being "\ud83d\ude4f".encode('utf-16',…

python unicode encoding emoji surrogate-pairs

asked Oct 24 '16 at 16:13

hilssu

votes

4 answers

Java Can't Open a File with Surrogate Unicode Values in the Filename?

I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM…

java file unicode filenames surrogate-pairs

asked Oct 09 '09 at 19:21

Bear

votes

2 answers

Detecting and Retrieving codepoints and surrogates from a Delphi String

I am trying to better understand surrogate pairs and Unicode implementation in Delphi. If I call length() on the Unicode string S := 'Ĥà̲V̂e' in Delphi, I will get back, 8. This is because the lengths of the individual characters…

delphi unicode surrogate-pairs

asked Aug 14 '15 at 23:47

sse

votes

3 answers

Issue with surrogate unicode characters in F#

I'm working with strings, which could contain surrogate unicode characters (non-BMP, 4 bytes per character). When I use "\Uxxxxxxxxv" format to specify surrogate character in F# - for some characters it gives different result than in the case of…

c# unicode f# surrogate-pairs

asked Apr 12 '12 at 12:58

Vitaliy

2,744
1
24
39

2 3 4 5 6 7 8 Next