I am looking for efficient way to load a huge file with data.
The file has the following format
1\tword1\tdata
2\tword2\tdata
3\tword3\tdata
\r\n
1\tword4\tdata
2\tword2\tdata
\r\n
where \r\n
defines the end of the sentences which consist of the words.
I am interested in loading the file and in saving the structure, i.e. I want to refer to sentence and to the word in the sentence, in general as result I want to get something like this
data = [sentence1, sentence2,... ]
where sentence = [word1,word2,...]
Loading file line by line take a lot of time, loading file by batches much more efficient, however I don't know how to parse and divide the data to the sentences.
Currently I use the following code
def loadf(filename):
n = 100000
data = []
with open(filename) as f:
while True:
next_n_lines = list(islice(f, n))
if not next_n_lines:
break
data.extend([line.strip().split('\t') for line in next_n_lines])
With this code I don't know how to divide the data to the sentences, in addition I suspect that extend
not actually extend the current list but create a new one and reassign, because it's extremely slow.
I would appreciate any help.