Python parsing a huge file

Question

I am looking for efficient way to load a huge file with data.

The file has the following format

1\tword1\tdata

2\tword2\tdata

3\tword3\tdata

\r\n

1\tword4\tdata

2\tword2\tdata

\r\n

where \r\n defines the end of the sentences which consist of the words.

I am interested in loading the file and in saving the structure, i.e. I want to refer to sentence and to the word in the sentence, in general as result I want to get something like this

data = [sentence1, sentence2,... ]

where sentence = [word1,word2,...]

Loading file line by line take a lot of time, loading file by batches much more efficient, however I don't know how to parse and divide the data to the sentences.

Currently I use the following code

def loadf(filename):
    n = 100000
    data = []
    with open(filename) as f:
        while True:
            next_n_lines = list(islice(f, n))
            if not next_n_lines:
                break
            data.extend([line.strip().split('\t') for line in next_n_lines])

With this code I don't know how to divide the data to the sentences, in addition I suspect that extend not actually extend the current list but create a new one and reassign, because it's extremely slow.

I would appreciate any help.

"Loading file line by line take a lot of time, loading file by batches much more efficient" - you sure about that? Did you actually time it? Python reads the file in chunks to feed the line iterator, so you don't have to handle that yourself. — user2357112, Dec 18 '13 at 07:12
@user2357112, let me rephrase it, running `extend` on batch of lines was faster that `append` each line every time. — user16168, Dec 18 '13 at 07:17
@user16168: That probably depends on exactly how your `append`-based code was structured. — user2357112, Dec 18 '13 at 07:20
You may try to use `for line in sys.stdin`. I recently used it to deal with a 1000MB-sized file and it took just a few seconds to read it (on an SSD). On Linux/Mac OS you would use `python PYTHONFILE.py < DATAFILE`. However, I'm not sure whether it's the fastest way of doing it. — Tim Zimmermann, Dec 18 '13 at 07:20
@TimZimmermann: You'd probably use input redirection with `<` rather than `cat`. — user2357112, Dec 18 '13 at 07:21
Try this http://stackoverflow.com/questions/14289421/how-to-use-mmap-in-python-when-the-whole-file-is-too-big — Kracekumar, Dec 18 '13 at 07:23
related: [How to read tokens without reading whole line or file](http://stackoverflow.com/q/20019503/4279) — jfs, Dec 18 '13 at 07:27

score 4 · Accepted Answer · answered Dec 18 '13 at 07:25

4

How about:

import csv
from itertools import groupby

with open(yourfile) as fin:
    tabin = csv.reader(fin, delimiter='\t')
    sentences = [[el[1] for el in g] for k, g in groupby(tabin, bool) if k]

answered Dec 18 '13 at 07:25

Jon Clements

138,671
33
247
280

It's amazing, it takes just few seconds, thank you very much! – user16168 Dec 18 '13 at 07:31

Python parsing a huge file

1 Answers1