3

I have some code in Python that builds a dictionary of about 250K strings (words) as keys with each value having an array of strings. The Python version runs in about 0.5 seconds. I needed to port this to Swift but my Swift port runs in 10.1 seconds, 20 times slower.

Here is the Python code:

wordsDictionary = defaultdict(list)
for word in words:
    wordsDictionary[sort_string(word)].append(word)

And here is the Swift code:

var wordsDictionary : Dictionary<String, [String]> = Dictionary()
for word in words {
    let sortedWord : String = String(word.characters.sort())
    if wordsDictionary[sortedWord] == nil {
        wordsDictionary[sortedWord] = []
    }
    wordsDictionary[sortedWord]?.append(word)
}

Is there any way to speed up the Swift version or are Swift dictionaries still that much slower than Python?

Jeff Zacharias
  • 381
  • 3
  • 12
  • 1
    Have you actually profiled this to see what the slow part is? This isn't just basic dictionary operations. There's a lot of things happening here... – nhgrif Jan 30 '16 at 19:08
  • 1
    I'm going to guess string sort is the difference. – Nick Bailey Jan 30 '16 at 19:20
  • Thanks. I had not profiled inside the loop and after doing that I found that the string sorts take about 5s, the array allocations about 3s, and the dictionary appends about 2 seconds, a total of about 10s. So yes, this isn't just a dictionary issue. The Python code does all of this too, including the string sort, in 0.5s, so ether Swift is way slower or I'm doing something wrong. – Jeff Zacharias Jan 30 '16 at 20:31
  • 1
    Is the swift code run in a playground or in an app? – vikingosegundo Jan 30 '16 at 20:42
  • 3
    My random guess: `sort_string` in Python does super-dumb sort over utf-8 byte array, while `word.characters` in Swift breaks the string into graphemes and then `.sort()` creates a whole new collection of a lot of small objects for each grapheme, effectively ruining the performance compared to Python. I suspect you might be gravely misunderstanding what "character" means — Swift is doing it correctly but slow, while Python is doing it very incorrectly but fast. And because in English/ASCII there's no difference, you're having troubles understanding why. – hamstergene Jan 30 '16 at 20:50
  • I appreciate the help. I ran the release build which ran in 4.5s. Duh. The string sort took 3s of the 4.5. After thinking about it I replaced: let sortedWord = String(word.characters.sort()) with: let sortedWordUTF8 = word.utf8.sort() let sortedWord : String = NSString(bytes: sortedWordUTF8, length: sortedWordUTF8.count, encoding: NSUTF8StringEncoding) as! String The sort time went from 3s down to 0.4s, and the total time down to 2.4s. I think I may be able to work with that for this app. – Jeff Zacharias Jan 30 '16 at 22:05
  • Use this for timing: http://stackoverflow.com/questions/24755558/measure-elapsed-time-in-swift/34322903#34322903 – Sverrisson Jan 30 '16 at 22:59
  • @hamstergene FWIW, Python 3 uses fully unicode strings and is no slower in this regard than Python 2 (which uses raw bytestrings) by my timings. In fact, it's likely faster. Seems to me Swift just need faster unicode operations. – Veedrac Jan 31 '16 at 02:15
  • @Veedrac, does it also compare unicode strings on grapheme level or on encoding level? – Jazzmaniac Jan 31 '16 at 02:19
  • @Veedrac Python does not parse graphemes, while Swift does. Try iterating and counting the number of characters in the string "a\u0308o\u0304e\u0302" (äōê) in both of them and see. Python says the length is 6 (number of codepoints), while Swift `.characters.count` says 3 (number of graphemes). – hamstergene Jan 31 '16 at 10:45
  • @Jazzmaniac My bad, didn't realize Swift characters were graphemes. However, modifying the Python code to sort graphemes *still* only roughly doubles the time required. It still seems Swift needs faster unicode operations ;). – Veedrac Jan 31 '16 at 12:55

2 Answers2

1

After changing the string sort and array allocation for the dictionary, here is my final code that executes 1.4s.

var wordsDictionary : Dictionary<String, [String]> = Dictionary()
for word in words {
    let sortedWordUTF8 = word.utf8.sort()
    let sortedWord : String = NSString(bytes: sortedWordUTF8, length: sortedWordUTF8.count, encoding: NSUTF8StringEncoding) as! String
    if wordsDictionary[sortedWord] == nil {
        wordsDictionary[sortedWord] = [word]
    }
    else {
        wordsDictionary[sortedWord]!.append(word)
    }
}
Jeff Zacharias
  • 381
  • 3
  • 12
  • 2
    Your code still contains a redundant check (`?.append` instead of `.append`). Furthermore, sorting the UTF8 bytes will not in general result in a valid UTF8 sequence, so your code is broken for everything except ASCII strings. You **must** sort the characters instead. – Konrad Rudolph Jan 31 '16 at 10:57
  • I edit the code about and removed the redundant check. I meant to say that for the my app, which deals only with ASCII strings, this string sort will work. – Jeff Zacharias Jan 31 '16 at 11:38
  • If your application is only dealing with ASCII strings, `Strings` is the wrong type — use a byte array instead. You’re going to avoid a loot of very annoying bugs. – Konrad Rudolph Jan 31 '16 at 12:08
0

The key differences between your Python code and your Swift code seem to be:

As already mentioned in the comments to your question, Swift's internal string representation is quite different from that of Python and string comparison works on grapheme level rather than encoding level. As you've already eliminated this I won't go into more detail.

Next, in Swift you ask the dictionary to look up your key-string twice per iteration. Even for a hashed lookup this will likely take some time as I doubt the second access will be optimised away. You can avoid this by asking for an DictionaryIndex using the indexForKey method and reuse this. Accessing the dictionary with this index should be faster.

Lastly, you check an optional that does not need checking. The else branch in the Swift code certainly finds a non-nil dictionary entry.

Veedrac
  • 58,273
  • 15
  • 112
  • 169
Jazzmaniac
  • 109
  • 4
  • 2
    That’s wrong. Python `list` is actually a vector, not a linked list. – Konrad Rudolph Jan 31 '16 at 02:33
  • @KonradRudolph, ok, I stand corrected. I find it a little poor to downvote an answer based on just that however. Or do you disagree with the rest too? – Jazzmaniac Jan 31 '16 at 02:42
  • 1
    That’s literally what downvotes are for. Now that you’ve corrected it, the downvote gets removes. – Konrad Rudolph Jan 31 '16 at 02:57
  • FWIW, changing the Python to use two lookups just like the Swift (even if unidiomatically so) doesn't change the time. Changing the Python to sort graphemes (using `regex` to extract graphemes) roughly doubles the time for the loop. That still leaves and order of magnitude between them. – Veedrac Jan 31 '16 at 13:05
  • @Veedrac, the point is not adding an additional lookup to python, but to remove one from Swift. Python might very well cache the lookup internally. Also, just claiming that you get twice the time with correct grapheme handling doesn't float my boat. Where's your code? I'd love to see the regex that gets all graphemes sorted out. – Jazzmaniac Jan 31 '16 at 13:24
  • @Jazzmaniac I was about to give a benchmark, but it turns out I wasn't also normalizing, which I didn't realize Swift was doing. That's fine, it's still not as nearly as slow as Swift, but I can't figure out what Swift considers "normalization" to be. `"e\u{305}\u{31b}" != "e\u{31b}\u{305}"` despite them being `NFD`, `NFKD`, `NFC` and `NFKC` equivalent, for example, but `"e\u{301}" == "\u{e9}"` so *some* normalization is occurring. The documentation is infuriatingly vague. – Veedrac Jan 31 '16 at 15:22
  • @Jazzmaniac Ah, [supposedly it's NFD](http://stackoverflow.com/a/25775112/1763356), except that it seems to be broken. [I've started a question about this.](http://stackoverflow.com/questions/35115717/what-kind-of-normalization-do-swift-string-comparisons-use) – Veedrac Jan 31 '16 at 15:46