5
#!/usr/bin/python

import random
import string

appendToFile = open("appendedFile", "a" )

# Generator

for i in range(1, 100000):

    chars = "".join( [random.choice(string.letters) for i in xrange(15)] )
    chars2 = "".join( [random.choice(string.letters) for i in xrange(15)] )

    appendToFile.write(chars + ":" + chars2 + "\n")

appendToFile.close()

Code modified from this question.

The above code generates 100,000 lines of random text in the format of STRING:STRING. Resultant text file is 3.1 MB.

How would one rapidly alphabetise the file, using the first STRING in STRING:STRING? Case is irrelevant.

Bubble sort is very slow, no?

Community
  • 1
  • 1
torger
  • 2,308
  • 4
  • 28
  • 35
  • 1
    Are we to take advantage of the fact that this can fit in RAM on modern machines, or do you need a routine like the real Unix sort(1) command that can cache intermediate results out to disk and so work on files of unlimited size? – Brandon Rhodes Dec 08 '09 at 23:13
  • Advantage of excess RAM. – torger Dec 08 '09 at 23:18

3 Answers3

8

The obvious first approach is simply to use the built-in sort feature in Python. Is this not what you had in mind? If not, why? With only 100,000 lines of random text, the built-in sort would be very fast.

lst = open("appendedFile", "rt").readlines()
lst.sort(key=str.lower)

Done. We could do it as a one-liner if you really wanted to:

lst = sorted(open("appendedFile", "rt").readlines(), key=str.lower)

EDIT: I just checked, and strings.letters includes both upper-case and lower-case letters. So, above code is modified to be case-insensitive.

EDIT: more on sorting in Python: http://wiki.python.org/moin/HowTo/Sorting

steveha
  • 74,789
  • 21
  • 92
  • 117
5

This is very fast (under 1 second on my computer). It uses a case-insensitive sort, which is assume what you mean by "case is irrelevant"?

#!/usr/bin/python

appendToFile = open("appendedFile", "r")
sortToFile = open("sortedFile", "w")

for line in sorted(appendToFile, key = str.lower):
    sortToFile.write(line)
Denis Otkidach
  • 32,032
  • 8
  • 79
  • 100
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
1

Try this (case insensitive):

l=file(appendedFile).readlines()
l.sort(key=lambda x:x.lower())

For these kinds of sizes optimalisation is not really necessary (timings on my slow machine ;-):

christophe@orion:~$ time python -c "l=file('appendedFile').readlines();l.sort(key=lambda x:x.lower())"

real    0m0.615s
user    0m0.576s
sys 0m0.024s
ChristopheD
  • 112,638
  • 29
  • 165
  • 179
  • Thanks for the timing mechanism - didn't know it existed. – torger Dec 09 '09 at 00:16
  • The "time" command is available under Linux. It is probably available under Mac OS X. You can also get it for Windows but Microsoft didn't build it in. The easiest way to get it for Windows is to install Cygwin. A purely Python-based approach, which is thus portable, is to use the "timeit" module: http://docs.python.org/library/timeit.html – steveha Dec 09 '09 at 19:52
  • @CrhistopheD, you don't need the `lambda` function; you can just use: `key=str.lower` – steveha Dec 09 '09 at 19:53