def frequency_with_batch(textlines, freq_counter):
docs = nlps(textlines)
log_every_n = 100000
for i, doc in enumerate(docs):
log_every_n += 1
tokens = doc['words']
freq_counter.update(tokens)
I have a 40G text file which I want to count word frequencies. It reads from the file with a batch of 1000 lines each batch. The counter is:
freq_counter = collections.Counter()
I didn't profile accurately. But when it's running, it seems becoming slower by watching the log. It now has completed processing of 30 millions lines. I don't see why there are other factors to make it slower. The memory of computer is 300G, which is big enough.
Will the Counter naturally become slower when used this way?
ADDITIONS:
- 'tokens' is a list of words, split by the nlps() function.
- I omitted the log print statement.