String.utf16.index(_:offsetBy:) doesn't work as expected on string which contains emoji

Question

I'm writing a code to replace all occurrences of string1 within string2 with a string3. For example:

string1: "he"
string2: "hello everybody. I'm here to help you."
string3: "XY"
output:  "XYllo everybody. I'm XYre to XYlp you."

Along the way, I noticed my method doesn't work correctly if string1 or string2 contain emoji. I wrote a small piece of code that represents the problem:

let strA = "Hello,My fried"
let strB = "Hi,My fried"
let strC = "‍♂️,My fried"


let offsetBA = "Hi".utf16.count - "Hello".utf16.count
let offsetCA = "‍♂️".utf16.count - "Hello".utf16.count

let idx1 = strA.utf16.index(strA.utf16.startIndex, offsetBy: 11)
print(strA[idx1]) // Output: i

let idx2 = strB.utf16.index(idx1, offsetBy: offsetBA)
print(strB[idx2]) // Output: i

let idx3 = strC.utf16.index(idx1, offsetBy: offsetCA)
print(strC[idx3]) // Output: ,

In summary, there are 3 strings and I create an index (idx1) which points to i in the first string. I know that instead of Hello, the second string starts with Hi and the third one starts with an emoji. So I want to adjust idx1 and store the results in idx2 and idx3 in such a way that they still point to i in string2 and string3 respectively. The calculation works on all-ASCII string2 but doesn't work on string3 (idx points to ,). I took a look into the implementation of String, UTF16View, String.Index, ... but could find why it is so. I ran this code on Swift5.1 and Swift4.

this might help https://stackoverflow.com/a/44533486/2303865 — Leo Dabus, Nov 14 '19 at 17:29
You're implicitly encoding the assumption that there is a one to one association between UTF 16 code points and characters as you see them on screen. As you've discovered, that isn't true. `‍♂️` is 5 unicode code points, spanning 7 UInt16s. https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%99%8B%F0%9F%8F%BB%E2%80%8D%E2%99%82%EF%B8%8F — Alexander, Nov 14 '19 at 19:45
And generally, you must use indices *only* with the collections that they were created with. `idx1` is an index into `strA.utf16` and must not be used with `strB.utf16` or `strC.utf16`. — Martin R, Nov 14 '19 at 20:02

String.utf16.index(_:offsetBy:) doesn't work as expected on string which contains emoji

0 Answers0