I have a huge text file for which I want to create a dictionary (Counter). Currently, I am doing it using the following code:
with open(file_name) as input_doc:
for line in input_doc:
for word in line.strip().split():
vocab[word] += 1
but, since the file is huge, it takes a lot of time. So, I am looking for a faster way of doing this.
The most straight forward solution that comes into mind is storing a bunch of lines in a list (small batches) and process each batch separately (in parallel with the other batches), and at the end, merging the results. This way, we can save a lot of time and can process the previously seen batches (in parallel) while the main thread is reading next batch of lines from file.
something like:
buffer_size = 1000
buff = []
vocab = Counter()
number_of_sentences = 1
with open(file_name) as input_doc:
for line in input_doc:
if number_of_sentences % buffer_size == 0:
vocab += update_dictionary(buff) ### Here I should create and call a new thread to work on the new batch
buff = []
else
buff.append(line)
number_of_sentences += 1
Here, the update_dictionary() method reads all the sentences in the given list and updates its local dictionary. Once it is done, its local dictionary should be merged with the global one. I tried for a couple of hours, but unfortunately since I never implemented a multi-threaded code in Python, I couldn't manage to make it work. Could you please help me to implement this idea?
Thank you very much.