Adapted from the answer to this question: Does Python have a built in function for string natural sort?
import re
def nat_cmp(a, b):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return cmp(alphanum_key(a), alphanum_key(b))
print nat_cmp('foo10z', 'foo100z')
print cmp('foo10z', 'foo100z') # <- notice the builtin yields a different result
Outputs:
-1
1
Update
Timed (with the example input) with ipython:
In [1]: %timeit nat_cmp('foo10z', 'foo100z')
100000 loops, best of 3: 11.6 us per loop
Update 2
Speaking of performance... I'm not sure you understand how fast the re
lib actually is, when compared to pure-python code. To demonstrate, I've taken the key function (the portion with re
), and rewritten it several times, in pure-python, and compared their speeds against the, much simpler, use of re.split
.
import re
from itertools import groupby
def regex_key(key):
"""Traditional, regular-expression-based nat-sort key."""
convert = lambda text: int(text) if text.isdigit() else text.lower()
return [convert(c) for c in re.split('([0-9]+)', key)]
def fast_key(value):
"""Attempt #1 to go faster than 'slow' 're' library."""
result = []
for is_int, chunk in groupby(value.lower(), str.isdigit):
if is_int:
result.append(int(''.join(chunk)))
else:
result.append(tuple(chunk))
return result
def faster_key(value):
"""Attempt #2. 'Low-level' python."""
start_idx = 0
is_num = None
result = []
for idx, c in enumerate(value.lower()):
now_is_num = c.isdigit()
if is_num is not None and now_is_num != is_num:
buf = value[start_idx:idx]
result.append(int(buf) if is_num else buf)
start_idx = idx
is_num = None
is_num = now_is_num
buf = value[start_idx:]
result.append(int(buf) if is_num else buf)
return result
Next, I run these against a simple benchmark:
from datetime import datetime
def benchmark(fn):
print "Benching %s (run 1000 times)" % fn.__name__
start = datetime.now()
for x in xrange(1000):
# run key function on something approx 100 chars long
fn('asdf1234sdfg234jhd88123j2134 - 123d34123djfsk'*2)
print "\t%s" % (datetime.now() - start)
benchmark(regex_key)
benchmark(fast_key)
benchmark(faster_key)
Here are the results:
Benching regex_key (run 1000 times)
0:00:00.025908
Benching fast_key (run 1000 times)
0:00:00.065567
Benching faster_key (run 1000 times)
0:00:00.042654
Now, I'm sure there are things I could do to make my key-func implementations faster, but unless I'm missing something huge, it's going to be difficult to get as-fast as the re.split
code (using pure-python, that is).