2

I am trying to figure out the best way to create a function that is equivalent to String.Replace("oldValue","newValue"); that can handle surrogate pairs.

My concern is that if there are surrogate pairs in the string and there is the possibility of a string that matches part of the surrogate pair that it would potentially split the surrogate and have corrupt data.

So my high level question is: Is String.Replace(string oldValue, string newValue); a safe operation when it comes to Unicode and surrogate pairs?

If not, what would be the best path forward? I am familiar with the StringInfo class that can split these strings into elements and such. I'm just unsure of how to go about the replace when passing in strings for the old and new values.

Thanks for the help!

Ibrennan208
  • 1,345
  • 3
  • 14
  • 31
  • Have you tried testing any strings yourself? – Michael Gunter May 04 '18 at 18:23
  • I have tried other functions and know for sure things like indices, substrings, and reversals can be corrupted. I have also read through the source code for the string replace and it seems to be dealing with chars so it seems unsafe. Unfortunately I don't know many surrogate pairs that could potentially overlap other characters because I am unfamiliar with all of them. I have been trying to find characters where this issue would arise, but figured I could post the question while researching. – Ibrennan208 May 04 '18 at 18:36

1 Answers1

2

It's safe, because strings in .NET are internally UTF-16. Unicode code point can be represented by one or two UTF-16 code units, and .NET char is one such code unit.

When code point is represented by two units, first unit is called high surrogate, and second is called low surrogate. What's important in context of this question is surrogate units belong to specific range, U+D800 - U+DFFF. This range is used only to represent surrogate pairs, single unit in this range has no meaning and is invalid.

For that reason, it's not possible to have valid utf-16 string which matches "part" of surrogate pair in another valid utf-16 string.

Note that .NET string can also represent invalid utf-16 string. If any argument to Replace is invalid - then it can indeed split surrogate pair. But - garbage in, garbage out, so I don't consider this a problem in given case.

Evk
  • 98,527
  • 8
  • 141
  • 191
  • Can the same then be said for `remove(string)`? – Ibrennan208 May 04 '18 at 20:06
  • @Ibrennan208 no. Suppose you have string `s` which has one surrogate pair, for example: `var s = char.ConvertFromUtf32(0x10FFFC);`. It contains 2 .NET characters, but represents one unicode "symbol". Now if you do `var m = s.Remove(0, 1);` - it will remove only one character. In result, `m` will be invalid utf-16 string with one dangling surrogate. – Evk May 04 '18 at 20:09
  • Oh my mistake, for some reason I thought there was a remove that took string as a parameter. Your explanation is good thank you. – Ibrennan208 May 04 '18 at 20:12
  • Would you be willing to take a look at this question as well? I'm kind of stuck on all of this surrogate pair logic :( https://stackoverflow.com/questions/50182335/what-is-a-unicode-safe-replica-of-string-indexofstring-input-that-can-handle-s – Ibrennan208 May 04 '18 at 20:19