3

I am looking for efficient way to load a huge file with data.

The file has the following format

1\tword1\tdata

2\tword2\tdata

3\tword3\tdata

\r\n

1\tword4\tdata

2\tword2\tdata

\r\n

where \r\n defines the end of the sentences which consist of the words.

I am interested in loading the file and in saving the structure, i.e. I want to refer to sentence and to the word in the sentence, in general as result I want to get something like this

data = [sentence1, sentence2,... ]

where sentence = [word1,word2,...]

Loading file line by line take a lot of time, loading file by batches much more efficient, however I don't know how to parse and divide the data to the sentences.

Currently I use the following code

def loadf(filename):
    n = 100000
    data = []
    with open(filename) as f:
        while True:
            next_n_lines = list(islice(f, n))
            if not next_n_lines:
                break
            data.extend([line.strip().split('\t') for line in next_n_lines])

With this code I don't know how to divide the data to the sentences, in addition I suspect that extend not actually extend the current list but create a new one and reassign, because it's extremely slow.

I would appreciate any help.

user16168
  • 625
  • 2
  • 8
  • 16
  • 3
    "Loading file line by line take a lot of time, loading file by batches much more efficient" - you sure about that? Did you actually time it? Python reads the file in chunks to feed the line iterator, so you don't have to handle that yourself. – user2357112 Dec 18 '13 at 07:12
  • What is the file size approximately? – Tim Zimmermann Dec 18 '13 at 07:13
  • @TimZimmermann, ~700Mb – user16168 Dec 18 '13 at 07:16
  • @user2357112, let me rephrase it, running `extend` on batch of lines was faster that `append` each line every time. – user16168 Dec 18 '13 at 07:17
  • @user16168: That probably depends on exactly how your `append`-based code was structured. – user2357112 Dec 18 '13 at 07:20
  • You may try to use `for line in sys.stdin`. I recently used it to deal with a 1000MB-sized file and it took just a few seconds to read it (on an SSD). On Linux/Mac OS you would use `python PYTHONFILE.py < DATAFILE`. However, I'm not sure whether it's the fastest way of doing it. – Tim Zimmermann Dec 18 '13 at 07:20
  • 1
    @TimZimmermann: You'd probably use input redirection with `<` rather than `cat`. – user2357112 Dec 18 '13 at 07:21
  • Try this http://stackoverflow.com/questions/14289421/how-to-use-mmap-in-python-when-the-whole-file-is-too-big – Kracekumar Dec 18 '13 at 07:23
  • related: [How to read tokens without reading whole line or file](http://stackoverflow.com/q/20019503/4279) – jfs Dec 18 '13 at 07:27

1 Answers1

4

How about:

import csv
from itertools import groupby

with open(yourfile) as fin:
    tabin = csv.reader(fin, delimiter='\t')
    sentences = [[el[1] for el in g] for k, g in groupby(tabin, bool) if k]
Jon Clements
  • 138,671
  • 33
  • 247
  • 280