Surrogate Pair Detection Fails

Question

I'm working on a minor side project in F# which involves porting existing C# code to F# and I've seemingly come across a difference in how regular expressions are handled between the two languages (I'm posting this to hopefully find out I am just doing something wrong).

This minor function simply detects surrogate pairs using the regular expression trick outlined here. Here's the current implementation:

let isSurrogatePair input =
    Regex.IsMatch(input, "[\uD800-\uDBFF][\uDC00-\uDFFF]")

If I then execute it against a known surrogate pair like this:

let result = isSurrogatePair "野"
printfn "%b" result

I get false in the FSI window.

If I use the equivalent C#:

public bool IsSurrogatePair(string input)
{
    return Regex.IsMatch(input, "[\uD800-\uDBFF][\uDC00-\uDFFF]");
}

And the same input value, I (correctly) get true back.

Is this a true issue? Am I simply doing something wrong in my F# implementation?

Fyodor Soikin · Accepted Answer · 2018-07-26T15:12:56.507

There appears to be a bug in how F# encodes escaped Unicode characters.
Here's from the F# Interactive (note the last two results):

> "\uD500".[0] |> uint16 ;;
val it : uint16 = 54528us
> "\uD700".[0] |> uint16 ;;
val it : uint16 = 55040us
> "\uD800".[0] |> uint16 ;;
val it : uint16 = 65533us
> "\uD900".[0] |> uint16 ;;
val it : uint16 = 65533us

Fortunately, this workaround works:

> let s = new System.String( [| char 0xD800 |] )
s.[0] |> uint16
;;

val s : System.String = "�"
val it : uint16 = 55296us

Based on that finding, I can construct a corrected (or, rather, workarounded) version of isSurrogatePair:

let isSurrogatePair input =
  let chrToStr code = new System.String( [| char code |] )
  let regex = "[" + (chrToStr 0xD800) + "-" + (chrToStr 0xDBFF) + "][" + (chrToStr 0xDC00) + "-" + (chrToStr 0xDFFF) + "]"
  Regex.IsMatch(input,  regex)

This version correctly returns true for your input.

I have just filed this issue on GitHub: https://github.com/Microsoft/visualfsharp/issues/338

For posterity: recent versions of F# have this resolved, literals do not exhibit this encoding problem anymore. — Abel, Jul 26 '18 at 14:42

score 3 · Answer 2 · answered Apr 01 '15 at 04:43

Seems that this is a legitimate F# bug, no argument there. Just wanted to suggest some alternative workarounds.

Don't embed the problem characters in the string itself, specify them using regex's normal unicode support. The regex pattern to match unicode codepoint XXXX is \uXXXX, so just escape your backslashes or use a verbatim string:

Regex.IsMatch(input, "[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]")
// or
Regex.IsMatch(input, @"[\uD800-\uDBFF][\uDC00-\uDFFF]")

Use built-in regex support for unicode blocks:

// high surrogate followed by low surrogate
Regex.IsMatch(input, @"(\p{IsHighSurrogates}|\p{IsHighPrivateUseSurrogates})\p{IsLowSurrogates}")

or properties

// 2 characters, each of which is half of a surrogate pair
// (maybe could give false-positive if both are, e.g. low-surrogates)
Regex.IsMatch(input, @"\p{Cs}{2}")

Surrogate Pair Detection Fails

2 Answers2