5

I was trying to find a fast way to sort strings in Python and the locale is a non-concern i.e. I just want to sort the array lexically according to the underlying bytes. This is perfect for something like radix sort. Here is my MWE

import numpy as np
import timeit

# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
   things = [f%x for x in range(numGrp)]
   return [things[x] for x in np.random.choice(numGrp, N)]

N=int(1e7)
K=100
id3 = randChar("id%010d", N//K, N)   # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds

As you can see it took 6.8 seconds which is almost 10x slower than R's radix sort below.

N = 1e7
K = 100
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE)
system.time(sort(id3,method="radix"))

I understand that Python's .sort() doesn't use radix sort, is there an implementation somewhere that allows me to sort strings as performantly as R?

AFAIK both R and Python "intern" strings so any optimisations in R can also be done in Python.

The top google result for "radix sort strings python" is this gist which produced an error when sorting on my test array.

xiaodai
  • 14,889
  • 18
  • 76
  • 140
  • Can you use Cython in your code ? If yes look at this [answer](https://stackoverflow.com/a/38280317/3926995) – Chiheb Nexus Dec 31 '17 at 02:13
  • 3
    My guess is that you will not beat the built-in `list.sort` or `sorted`. While I understand that radix sort is, in theory `O(N*M)` (size of array, length of strings) and thus generally better than the typical `O(N*logN)` merge/quick/tim sort algorithms. But the vast overhead of a pure Python implementation will almost certainly be much slower than the highly C-optimized built-in sorts. – user2390182 Dec 31 '17 at 02:14
  • Yes, Cython, Pythran as long as it's some sort of Python variant that I can call from Python is fine. Heck, even a way to call a C function is fine! – xiaodai Dec 31 '17 at 02:14
  • 1
    @schwobaseggl As far as I can tell R uses some clever trick with the interned Strings lookup to speed things up using radixsort. E.g. `sort(id3)` is slow in R but `sort(id3, method = "radix")` is blazing fast. – xiaodai Dec 31 '17 at 02:15
  • Normal Python `sort`/`sorted` wouldn't care about locale anyway unless you did something like `sorted(l, key=locale.strxfrm)`. – user2357112 Mar 21 '18 at 18:47

2 Answers2

3

It is true that R interns all strings, meaning it has a "global character cache" which serves as a central dictionary of all strings ever used by your program. This has its advantages: the data takes less memory, and certain algorithms (such as radix sort) can take advantage of this structure to achieve higher speed. This is particularly true for the scenarios such as in your example, where the number of unique strings is small relative to the size of the vector. On the other hand it has its drawbacks too: the global character cache prevents multi-threaded write access to character data.

In Python, afaik, only string literals are interned. For example:

 >>> 'abc' is 'abc'
 True
 >>> x = 'ab'
 >>> (x + 'c') is 'abc'
 False

In practice it means that, unless you've embedded data directly into the text of the program, nothing will be interned.


Now, for your original question: "what is the fastest way to sort strings in python"? You can achieve very good speeds, comparable with R, with python datatable package. Here's the benchmark that sorts N = 10⁸ strings, randomly selected from a set of 1024:

import datatable as dt
import pandas as pd
import random
from time import time
n = 10**8
src = ["%x" % random.getrandbits(10) for _ in range(n)]
f0 = dt.Frame(src)
p0 = pd.DataFrame(src)
f0.to_csv("test1e8.csv")

t0 = time(); f1 = f0.sort(0); print("datatable: %.3fs" % (time()-t0))
t0 = time(); src.sort(); print("list.sort: %.3fs" % (time()-t0))
t0 = time(); p1 = p0.sort_values(0); print("pandas:    %.3fs" % (time()-t0))

Which produces:

datatable: 1.465s / 1.462s / 1.460s (multiple runs)
list.sort: 44.352s
pandas:    395.083s

The same dataset in R (v3.4.2):

> require(data.table)
> DT = fread("test1e8.csv")
> system.time(sort(DT$C1, method="radix"))
   user  system elapsed 
  6.238   0.585   6.832 
> system.time(DT[order(C1)])
   user  system elapsed 
  4.275   0.457   4.738 
> system.time(setkey(DT, C1))  # sort in-place
   user  system elapsed 
  3.020   0.577   3.600 
Pasha
  • 6,298
  • 2
  • 22
  • 34
0

Jeremy Mets posted in the comments of this blog post that Numpy can sort string fairly by converting the array to np.araray. This indeed improve performance, however it is still slower than Julia's implementation.

import numpy as np
import timeit

# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
   things = [f%x for x in range(numGrp)]
   return [things[x] for x in np.random.choice(numGrp, N)]

N=int(1e7)
K=100
id3 = np.array(randChar("id%010d", N//K, N))   # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds
xiaodai
  • 14,889
  • 18
  • 76
  • 140