4

I was trying to implement Boyer-Moore algorithm in Swift Playground and I used Swift String.Index a lot and something that started to bother me is why indexes are kept 4 times bigger that what it seems they should be.

For example:

let why = "is s on 4th position not 1st".index(of: "s")

This code in Swift Playground will generate _compoundOffset 4 not 1. I'm sure there is a reason for doing this, but I couldn't find explanation anywhere.

It's not a duplicate of any question that explains how to get index of char in Swift, I know that, I used index(of:) function just to illustrate the question. I wanted to know why value of 2nd char is 4 not 1 when using String.Index.

So I guess the way it keeps indexes is private and I don't need to know the inside implementation, it's probably connected with UTF16 and UTF32 coding.

szooky
  • 177
  • 1
  • 12
  • 1
    Possible duplicate of [Finding index of character in Swift String](https://stackoverflow.com/questions/24029163/finding-index-of-character-in-swift-string) – Bilal Nov 07 '17 at 09:05
  • maybe because every character is represented by an [UTF-32](https://en.wikipedia.org/wiki/UTF-32) character in Swift...? – holex Nov 07 '17 at 09:22
  • 2
    String.Index should be treated as an opaque type. Because f the way Unicode works, it doesn't necessarily go up even in 4's. – JeremyP Nov 07 '17 at 09:30
  • 1
    What you show is not the real index you compare with. The real index is 1 and is the `encodedOffset` value of `why` and that is the real index you think it should be. – The iOSDev Nov 07 '17 at 11:39

1 Answers1

4

First of all, don’t ever assume _compoundOffset to be anything else than an implementation detail. _compoundOffset is an internal property of String.Index that uses bit masking to store two values in this one number:

  • The encodedOffset, which is the index's byte offset in terms of UTF-16 code units. This one is public and can be relied on. In your case encodedOffset is 1 because that's the offset for that character, as measured in UTF-16 code units. Note that the encoding of the string in memory doesn't matter! encodedOffset is always UTF-16.

  • The transcodedOffset, which stores the index's offset inside the current UTF-16 code unit. This is also an internal property that you can't access. The value is usually 0 for most indices, unless you have an index into the string's UTF-8 view that refers to a code unit which doesn't fall on a UTF-16 boundary. In that case, the transcodedOffset will store the offset in bytes from the encodedOffset.

Now why is _compoundOffset == 4? Because it stores the transcodedOffset in the two least significant bits and the encodedOffset in the 62 most significant bits. So the bit pattern for encodedOffset == 1, transcodedOffset == 0 is 0b100, which is 4.

You can verify all this in the source code for String.Index.

Ole Begemann
  • 135,006
  • 31
  • 278
  • 256
  • I assume this representation is because of the new "shared index" between the different views in Swift 4? – Martin R Nov 07 '17 at 16:12
  • @MartinR: Yes, exactly. Proposed in [SE-0180](https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md). – Ole Begemann Nov 07 '17 at 16:15