Quickly alphabetize a large file via python

Question

#!/usr/bin/python

import random
import string

appendToFile = open("appendedFile", "a" )

# Generator

for i in range(1, 100000):

    chars = "".join( [random.choice(string.letters) for i in xrange(15)] )
    chars2 = "".join( [random.choice(string.letters) for i in xrange(15)] )

    appendToFile.write(chars + ":" + chars2 + "\n")

appendToFile.close()

Code modified from this question.

The above code generates 100,000 lines of random text in the format of STRING:STRING. Resultant text file is 3.1 MB.

How would one rapidly alphabetise the file, using the first STRING in STRING:STRING? Case is irrelevant.

Bubble sort is very slow, no?

Are we to take advantage of the fact that this can fit in RAM on modern machines, or do you need a routine like the real Unix sort(1) command that can cache intermediate results out to disk and so work on files of unlimited size? — Brandon Rhodes, Dec 08 '09 at 23:13

steveha · Answer 1 · 2009-12-08T23:24:05.493

8

The obvious first approach is simply to use the built-in sort feature in Python. Is this not what you had in mind? If not, why? With only 100,000 lines of random text, the built-in sort would be very fast.

lst = open("appendedFile", "rt").readlines()
lst.sort(key=str.lower)

Done. We could do it as a one-liner if you really wanted to:

lst = sorted(open("appendedFile", "rt").readlines(), key=str.lower)

EDIT: I just checked, and strings.letters includes both upper-case and lower-case letters. So, above code is modified to be case-insensitive.

EDIT: more on sorting in Python: http://wiki.python.org/moin/HowTo/Sorting

edited Dec 08 '09 at 23:24

answered Dec 08 '09 at 23:15

steveha

74,789
21
92
117

If not, why? - New to python XD. I created that code so that you'd have a file to work from. – torger Dec 08 '09 at 23:19
FYI, the built-in Python sort for lists is a merge sort. Complexity O(n*log n). – BrainCore Dec 08 '09 at 23:25

score 5 · Accepted Answer · edited Dec 11 '09 at 09:42

5

This is very fast (under 1 second on my computer). It uses a case-insensitive sort, which is assume what you mean by "case is irrelevant"?

#!/usr/bin/python

appendToFile = open("appendedFile", "r")
sortToFile = open("sortedFile", "w")

for line in sorted(appendToFile, key = str.lower):
    sortToFile.write(line)

edited Dec 11 '09 at 09:42

Denis Otkidach

32,032
8
79
100

answered Dec 08 '09 at 23:20

Mark Byers

811,555
193
1,581
1,452

2

Why not just: `... sorted(..., key=str.lower)` ? – Don O'Donnell Dec 09 '09 at 05:11

ChristopheD · Answer 3 · 2009-12-08T23:26:54.777

1

Try this (case insensitive):

l=file(appendedFile).readlines()
l.sort(key=lambda x:x.lower())

For these kinds of sizes optimalisation is not really necessary (timings on my slow machine ;-):

christophe@orion:~$ time python -c "l=file('appendedFile').readlines();l.sort(key=lambda x:x.lower())"

real    0m0.615s
user    0m0.576s
sys 0m0.024s

edited Dec 08 '09 at 23:26

answered Dec 08 '09 at 23:20

ChristopheD

112,638
29
165
179

Thanks for the timing mechanism - didn't know it existed. – torger Dec 09 '09 at 00:16
The "time" command is available under Linux. It is probably available under Mac OS X. You can also get it for Windows but Microsoft didn't build it in. The easiest way to get it for Windows is to install Cygwin. A purely Python-based approach, which is thus portable, is to use the "timeit" module: http://docs.python.org/library/timeit.html – steveha Dec 09 '09 at 19:52
@CrhistopheD, you don't need the `lambda` function; you can just use: `key=str.lower` – steveha Dec 09 '09 at 19:53

Quickly alphabetize a large file via python

3 Answers3