Issue with surrogate unicode characters in F#

Question

I'm working with strings, which could contain surrogate unicode characters (non-BMP, 4 bytes per character).

When I use "\Uxxxxxxxxv" format to specify surrogate character in F# - for some characters it gives different result than in the case of C#. For example:

C#:

string s = "\U0001D11E";
bool c = Char.IsSurrogate(s, 0);
Console.WriteLine(String.Format("Length: {0}, is surrogate: {1}", s.Length, c));

Gives: Length: 2, is surrogate: True

F#:

let s = "\U0001D11E"
let c = Char.IsSurrogate(s, 0)
printf "Length: %d, is surrogate: %b" s.Length c

Gives: Length: 2, is surrogate: false

Note: Some surrogate characters works in F# ("\U0010011", "\U00100011"), but some of them doesn't work.

Q: Is this is bug in F#? How can I handle allowed surrogate unicode characters in strings with F# (Does F# has different format, or only the way is to use Char.ConvertFromUtf32 0x1D11E)

Update:
s.ToCharArray() gives for F# [| 0xD800; 0xDF41 |]; for C# { 0xD834, 0xDD1E }

These are framework methods so don't differ between C# and F#. Quacks like a compiler bug handling the string literal. Document what you get out of s.ToCharArray(). — Hans Passant, Apr 12 '12 at 13:12
1) Char.IsSurrogate has 2 signatures - second allows to use string and position; 2) *let s = '\U0001D11E'* results in compiler error — Vitaliy, Apr 12 '12 at 13:20

score 8 · Answer 1 · answered Apr 12 '12 at 22:35

This is a known bug in the F# compiler that shipped with VS2010 (and SP1); the fix appears in the VS11 bits, so if you have the VS11 Beta and use the F# 3.0 compiler, you'll see this behave as expected.

(If the other answers/comments here don't provide you with a suitable workaround in the meantime, let me know.)

score 5 · Accepted Answer · answered Apr 12 '12 at 13:34

5

That obviously means that F# makes mistake while parsing some string literals. That is proven by the fact character you've mentioned is non-BMP, and in UTF-16 it should be represented as pair of surrogates. Surrogates are words in range 0xD800-0xDFFF, while neither of chars in produced string fits in that range.

But processing of surrogates doesn't change, as framework (what is under the hood) is the same. So you already have answer in your question - if you need string literals with non-BMP characters in your code, you should just use Char.ConvertFromUtf32 instead of \UXXXXXXXX notation. And all the rest processing will be just the same as always.

answered Apr 12 '12 at 13:34

Andriy K

3,302
31
42

Thanks, and yes Char.ConvertFromUtf32 could be used as solution is some cases, it for sure gives limitation (I could not declare characters in such way in constants) – Vitaliy Apr 12 '12 at 13:40
You can hack constants like this: ``\uD834\uDD1E``. It's not very readable, probably it's better to add comment describing what's that, but still better that nothing. – Andriy K Apr 12 '12 at 13:44

score 1 · Answer 3 · answered Apr 12 '12 at 13:23

1

It seem to me that this is something connected with different forms of normalization. Both in C# and in F# s.IsNormalized() returns true But in C#

s.ToCharArray() gives us {55348, 56606} //0xD834, 0xDD1E

and in F#

s.ToCharArray() gives us {65533, 57422} //0xFFFD, 0xE04E

And as you probably know System.Char.IsSurrogate is implemented in the following way:

   public static bool IsSurrogate(char c)
   { 
        return (c >= HIGH_SURROGATE_START && c <= LOW_SURROGATE_END); 
   }

where

   HIGH_SURROGATE_START = 0x00d800; 
   LOW_SURROGATE_END    = 0x00dfff;

So in C# first char (55348) is less than LOW_SURROGATE_END but in F# first char (65533) is not less than LOW_SURROGATE_END.

I hope this helps.

answered Apr 12 '12 at 13:23

VMykyt

1,589
12
17

Thank you for problem description, so the problem you think is with different normalization used in F#. Ok, but how can I add surrogate character into string with F#, if *"\U0001D11E"* doesn't work for me? – Vitaliy Apr 12 '12 at 13:29
I don't think that this problem has anything to do with normalization. Actually, string like this should be just parsed and presented as is, and that's definitely of what happens. – Andriy K Apr 12 '12 at 13:54

Issue with surrogate unicode characters in F#

3 Answers3