0

Does Python have a fast function for doing a natural sort between two strings? Not sorting necessarily, just a comparison function that returns 0, -1 or 1 depending on which is ahead in natural order or same.

EDIT: the proposed function is correct but it is far too slow. How can this be done quickly in Python?

NOTE This is not a duplicate of the post many people suggest -- since these other threads do not address the efficiency issue. The current solutions work and are correct, but make a regular expression call for each line, which is prohibitively expensive. I would like a solution that is efficient and can be used to make millions of comparisons.

Adam Wagner
  • 15,469
  • 7
  • 52
  • 66
  • 6
    Define "natural order". – dan04 Dec 06 '11 at 23:00
  • 1
    possible duplicate of [Does Python have a built in function for string natural sort?](http://stackoverflow.com/questions/4836710/does-python-have-a-built-in-function-for-string-natural-sort) – mac Dec 06 '11 at 23:19
  • It's not a duplicate since these other threads do not address the efficiently issue. The current solutions make a regular expression call for each line, which is prohibitively expensive –  Dec 07 '11 at 01:12
  • 1
    @user248237 - (1) An edit should improve clarity of a question, not change its nature. You started asking about the function **then** changed into a question about speed. (2) Regex are really fast [relative to the amount of work they do]. (3) Comparation functions like `cmp` are fast because they compare at bit level. Anything that requires a "human logic" is going to be **by far** slower. – mac Dec 07 '11 at 07:40

3 Answers3

5

cmp is the built-in function to do just that.

>>> a = 'hello'
>>> b = 'world'
>>> cmp(a, b)
-1

EDIT: with "natural sort" do you refer to sorting numbers as humans would do? If this is the case, than this is a possible recipe.

mac
  • 42,153
  • 26
  • 121
  • 131
  • @user248237 - No. See my edits if that is the behaviour you are looking for (I did not paste the code directly as there is a long thread of interesting observation on the ActiveState website). – mac Dec 06 '11 at 23:14
5

Adapted from the answer to this question: Does Python have a built in function for string natural sort?

import re

def nat_cmp(a, b):
    convert = lambda text: int(text) if text.isdigit() else text.lower()
    alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]

    return cmp(alphanum_key(a), alphanum_key(b))

print nat_cmp('foo10z', 'foo100z')
print cmp('foo10z', 'foo100z')  # <- notice the builtin yields a different result

Outputs:

-1
1

Update

Timed (with the example input) with ipython:

In [1]: %timeit nat_cmp('foo10z', 'foo100z')
100000 loops, best of 3: 11.6 us per loop

Update 2

Speaking of performance... I'm not sure you understand how fast the re lib actually is, when compared to pure-python code. To demonstrate, I've taken the key function (the portion with re), and rewritten it several times, in pure-python, and compared their speeds against the, much simpler, use of re.split.

import re
from itertools import groupby

def regex_key(key):
    """Traditional, regular-expression-based nat-sort key."""
    convert = lambda text: int(text) if text.isdigit() else text.lower()
    return [convert(c) for c in re.split('([0-9]+)', key)]

def fast_key(value):
    """Attempt #1 to go faster than 'slow' 're' library."""
    result = []
    for is_int, chunk in groupby(value.lower(), str.isdigit):
        if is_int:
            result.append(int(''.join(chunk)))
        else:
            result.append(tuple(chunk))
    return result

def faster_key(value):
    """Attempt #2.  'Low-level' python."""
    start_idx = 0
    is_num = None
    result = []
    for idx, c in enumerate(value.lower()):
        now_is_num = c.isdigit()
        if is_num is not None and now_is_num != is_num:
            buf = value[start_idx:idx]
            result.append(int(buf) if is_num else buf)
            start_idx = idx
            is_num = None
        is_num = now_is_num
    buf = value[start_idx:]
    result.append(int(buf) if is_num else buf)
    return result

Next, I run these against a simple benchmark:

from datetime import datetime

def benchmark(fn):
    print "Benching %s (run 1000 times)" % fn.__name__

    start = datetime.now()
    for x in xrange(1000):
        # run key function on something approx 100 chars long
        fn('asdf1234sdfg234jhd88123j2134 - 123d34123djfsk'*2)
    print "\t%s" % (datetime.now() - start)

benchmark(regex_key)
benchmark(fast_key)
benchmark(faster_key)

Here are the results:

Benching regex_key (run 1000 times)
    0:00:00.025908
Benching fast_key (run 1000 times)
    0:00:00.065567
Benching faster_key (run 1000 times)
    0:00:00.042654

Now, I'm sure there are things I could do to make my key-func implementations faster, but unless I'm missing something huge, it's going to be difficult to get as-fast as the re.split code (using pure-python, that is).

Community
  • 1
  • 1
Adam Wagner
  • 15,469
  • 7
  • 52
  • 66
  • 1
    You can no longer compare `int` to `str`. Also, `cmp` was removed. – dan04 Dec 06 '11 at 23:17
  • Actually this function is incredibly slow -- I have to make millions of comparisons with it. Is htere a way to speed it up? –  Dec 07 '11 at 00:56
  • 1
    Can you give me an example of what you are comparing, and what you consider slow? – Adam Wagner Dec 07 '11 at 01:08
  • @AdamWagner I am comparing on the order of 50-100 million strings to each other. Each string is roughly 100 characters in length –  Dec 07 '11 at 01:13
  • @AdamWagner I'd also like to avoid the call to "re" since it will be done millions of times and that's very inefficient –  Dec 07 '11 at 01:13
  • 1
    What speeds are you seeing? I've tested this against strings roughly 100 chars long, and I'm getting avg runtimes of 65us. I understand you are working with mass-amounts of data, but if your performance requirements are that great, perhaps you should look into something other than python. That being said, this should (if I've done my math right) yield over 10k comparisons per second. – Adam Wagner Dec 07 '11 at 01:22
  • @AdamWagner I think Python can do this if it were simple operations like string split, etc. but evaluating a regular expression each time just seems very clunky/slow. The re library is slow in general. –  Dec 07 '11 at 02:48
  • What kind of speeds are you getting/expecting? – Adam Wagner Dec 07 '11 at 02:57
2

This will allow you to sort a list of strings naturally:

import re

unsorted_list = ["a1", "a2", "a11", "b1", "b2", "b11"]

def natural_key(s):
    return [ int(c) if c.isdigit() else c for c in re.split(r'(\d+)', s) ]

sorted_list = sorted(unsorted_list, key = lambda x : natural_key(x))

print sorted_list

This will return -1, 0 or 1, depending on if x > y

def natural_key(x, y):
     x = [int(c) if c.isdigit() else c for c in re.split(r'(\d+)', x)]
     y = [int(c) if c.isdigit() else c for c in re.split(r'(\d+)', y)]
     if x == y:
          return 0
     elif x > y:
          return 1
     else:
          return -1

This works in python 2.X and 3.X

Serdalis
  • 10,296
  • 2
  • 38
  • 58