10

I came across this code in a javascript open source project.

validator.isLength = function (str, min, max) 
    // match surrogate pairs in string or declare an empty array if none found in string
    var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
    // subtract the surrogate pairs string length from main string length
    var len = str.length - surrogatePairs.length;
    // now compare string length with min and max ... also make sure max is defined(in other words, max param is optional for function)
    return len >= min && (typeof max === 'undefined' || len <= max);
};

As far as I understand, the above code is checking the length of the string but not taking the surrogate pairs into account. So:

  1. Is my understanding of the code correct?

  2. What are surrogate pairs?

I have thus far only figured out that this is related to encoding.

Noman Ur Rehman
  • 6,707
  • 3
  • 24
  • 39
  • 2
    Surrogate pair is a Unicode term, which is completely unrelated to Javascript. Read this [Joel on Software's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html), [Unicode, UTF-8 and character encodings: What every developer should know](http://www.teknically-speaking.com/2014/02/unicode-utf-8-and-character-encodings_23.html) – phuclv Aug 13 '15 at 11:33
  • 1
    [Difference between composite characters and surrogate pairs](http://stackoverflow.com/q/22121184/995714) – phuclv Aug 13 '15 at 11:36

4 Answers4

22
  1. Yes. Your understanding is correct. The function returns the length of the string in Unicode Code Points.

  2. JavaScript is using UTF-16 to encode its strings. This means two bytes (16-bit) are used to represent one Unicode Code Point.

    Now there are characters (like the Emojis) in Unicode that have a that high code point so that they cannot be stored in 2 bytes (16bit) so they need to get encoded into two UTF-16 characters (4 bytes). These are called surrogate pairs.

Try this

var len = "".length // There is an emoji in the string (if you don’t see it)

vs

var str = ""
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
var len = str.length - surrogatePairs.length;

In the first example len will be 2 because the Emoji consists of two 2 UTF-16 characters. In the second example len will be 1.

You might want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

idmean
  • 14,540
  • 9
  • 54
  • 83
4

For your second question: 1. What is a "surrogate pair" in Java? The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.

In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF.

Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.

The surrogate code units are in two ranges known as "low surrogates" and "high surrogates", depending on whether they are allowed at the start or end of the two code unit sequence.

  1. https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396

Hope this helps.

Community
  • 1
  • 1
Eran Yogev
  • 891
  • 10
  • 20
0

In Unicode every character(including emoji) has a unique number. UTF-16 encodes this numbers into 16-bit chunks and with 16 bit we can make 65.536 different combinations.

What if we have a character which has a "unicode number" 70.000? Utf-16 has a algorithm for this situation. First it subtracts 65,536 from 70,000, which makes 4464. Then converts this number to binary, and adds zero to left side until there is 20 digits:

00000001000101110000

Then it splits this number: 0101110000 0000000100 Then adds this numbers to 55.296 and 55.320, which makes 55.664 and 55.324 in decimals. These numbers are the high surrogate and the low surrogate respectively. And then UTF-16 puts that two numbers(which are lower than 2^16) into successive two 8-bits chunks. And there is the important point: The points between 55.296, (55.296 + 2^10) and between 55.320, (55.320 + 2^10) do not represent any specific character. Because of this, if utf-16 encounters with a number within this interval, it knows that there is a second chunk and this is only the half of a character.

Hope this helps.

batunal
  • 1
  • 1
-4

Did you try to just google it?

The best description is http://unicodebook.readthedocs.io/unicode_encodings.html#surrogates

In UTF-16 some characters are stored in 8 bits and others in 16 bits.

Surrogate pair is a character representation that take 16 bits. Some character codes is reserved to be the first one in such pairs.

0xced
  • 25,219
  • 10
  • 103
  • 255
suvroc
  • 3,058
  • 1
  • 15
  • 29
  • Reread your linked reference. "In UTF-16, characters in ranges U+0000—U+D7FF and U+E000—U+FFFD are stored as a single 16 bits unit. Non-BMP characters (range U+10000—U+10FFFF) are stored as “surrogate pairs”, two 16 bits units..." So a character in UTF-16 will be either 16 bits or, if a surrogate pair is needed, 32 bits. No characters are stored as 8 bits. – Steven Rumbalski Apr 19 '18 at 14:49
  • I've upvoted this answer because this reference gives the best description of how surrogate pairs work in UTF-16. In a nutshell, U+0000-U+D7FF and U+E000-U+FFFD are encoded as 2 bytes; anything over U+10000 is encoded as a "surrogate pair" of two characters drawn from the gap in this range, U+D800-U+DFFF. A lone surrogate character is invalid, which is why the above regular expression in https://stackoverflow.com/a/31986749/1799811 works (I've omitted a small amount of complexity, for brevity) – kierantop Oct 02 '18 at 12:25