6

I have a text file includes over than 10 million lines. Lines like that:

37024469;196672001;255.0000000000
37024469;196665001;396.0000000000
37024469;196664001;396.0000000000
37024469;196399002;85.0000000000
37024469;160507001;264.0000000000
37024469;160506001;264.0000000000

As you seen, delimiter is ";". i would like to sort this text file by using python according to the second element. I couldnt use split function. Because it causes MemoryError. how can i manage it ?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 1
    Can you show the code you've tried? – mgilson Jan 22 '13 at 18:06
  • So you say you can't load the file in memory? – kaspersky Jan 22 '13 at 18:10
  • That's only about 380MB of data. Are you running this on a phone? – jordanm Jan 22 '13 at 18:10
  • @jordanm: double that when running on a 64-bit OS. 800MB for a 64-bit Python 3.3 build and the OP is letting Python decode the file data instead of treating it as binary. – Martijn Pieters Jan 22 '13 at 18:16
  • If you're using read() and reading the whole file into memory it will be an array of strings that _might_ fill up your memory. Try reading the file line by line, storing a three element tuple of floats/ints instead. It should be much more memory efficient and you might be able to squeeze the whole thing into your paltry memory. – André Laszlo Jan 22 '13 at 18:16
  • 1
    @AndréLaszlo: The string (even as Unicode data) takes fewer bytes, actually. Python int and float types need a little more once converted. – Martijn Pieters Jan 22 '13 at 18:19
  • @martjin can you elaborate? sys.getsizeof(37024469)->24 sys.getsizeof("37024469")->48. Even empty strings appear to take 40 bytes, while integers up to 10^18 take 24 bytes. Obviously these things can be implementation specific. – agentp Jan 22 '13 at 18:57
  • @MartijnPieters: You're completely right, of course. My bad. My suggestion eats almost three times the memory :) – André Laszlo Jan 22 '13 at 19:12
  • @george: You should compare one string with 2 integers and a float value instead. – Martijn Pieters Jan 22 '13 at 20:18
  • fair enough, but if you only store whole lines as strings then your sorting algorithm will need to parse the string each time it needs the value. Perhaps storing tuples of the form (int(field2) , whole-string) is optimal.. – agentp Jan 22 '13 at 20:37
  • @george: you only have to parse each line once to sort it, so you may as well do it *during* sorting: `lines.sort(key=lambda l: int(l.split(';', 2)[1]))`. – Martijn Pieters Jan 22 '13 at 20:39

3 Answers3

22

Don't sort 10 million lines in memory. Split this up in batches instead:

  • Run 100 100k line sorts (using the file as an iterator, combined with islice() or similar to pick a batch). Write out to separate files elsewhere.

  • Merge the sorted files. Here is an merge generator that you can pass 100 open files and it'll yield lines in sorted order. Write to a new file line by line:

    import operator
    
    def mergeiter(*iterables, **kwargs):
        """Given a set of sorted iterables, yield the next value in merged order
    
        Takes an optional `key` callable to compare values by.
        """
        iterables = [iter(it) for it in iterables]
        iterables = {i: [next(it), i, it] for i, it in enumerate(iterables)}
        if 'key' not in kwargs:
            key = operator.itemgetter(0)
        else:
            key = lambda item, key=kwargs['key']: key(item[0])
    
        while True:
            value, i, it = min(iterables.values(), key=key)
            yield value
            try:
                iterables[i][0] = next(it)
            except StopIteration:
                del iterables[i]
                if not iterables:
                    raise
    
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 6
    This technique is also known as [external sorting](http://en.wikipedia.org/wiki/External_sorting). –  Jan 22 '13 at 18:14
5

Based on Sorting a million 32-bit integers in 2MB of RAM using Python:

import sys
from functools import partial
from heapq import merge
from tempfile import TemporaryFile

# define sorting criteria
def second_column(line, default=float("inf")):
    try:
        return int(line.split(";", 2)[1]) # use int() for numeric sort
    except (IndexError, ValueError):
        return default # a key for non-integer or non-existent 2nd column

# sort lines in small batches, write intermediate results to temporary files
sorted_files = []
nbytes = 1 << 20 # load around nbytes bytes at a time
for lines in iter(partial(sys.stdin.readlines, nbytes), []):
    lines.sort(key=second_column) # sort current batch
    f = TemporaryFile("w+")
    f.writelines(lines)
    f.seek(0) # rewind
    sorted_files.append(f)

# merge & write the result
sys.stdout.writelines(merge(*sorted_files, key=second_column))

# clean up
for f in sorted_files:
    f.close() # temporary file is deleted when it closes

heapq.merge() has key parameter since Python 3.5. You could try mergeiter() from Martijn Pieters' answer instead or do Schwartzian transform on older Python versions:

iters = [((second_column(line), line) for line in file)
         for file in sorted_files] # note: this makes the sort unstable
sorted_lines = (line for _, line in merge(*iters))
sys.stdout.writelines(sorted_lines)

Usage:

$ python sort-k2-n.py < input.txt > output.txt
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
1

You can do it with an os.system() call to the bash function sort

sort -k2 yourFile.txt 
thefourtheye
  • 233,700
  • 52
  • 457
  • 497
peixe
  • 1,272
  • 3
  • 14
  • 31