0
def frequency_with_batch(textlines, freq_counter):
     
    docs = nlps(textlines)
    log_every_n = 100000
    for i, doc in enumerate(docs):
        
        log_every_n += 1
        tokens = doc['words']

        freq_counter.update(tokens)

I have a 40G text file which I want to count word frequencies. It reads from the file with a batch of 1000 lines each batch. The counter is:

freq_counter = collections.Counter()

I didn't profile accurately. But when it's running, it seems becoming slower by watching the log. It now has completed processing of 30 millions lines. I don't see why there are other factors to make it slower. The memory of computer is 300G, which is big enough.

Will the Counter naturally become slower when used this way?

ADDITIONS:

  1. 'tokens' is a list of words, split by the nlps() function.
  2. I omitted the log print statement.
marlon
  • 6,029
  • 8
  • 42
  • 76
  • 1
    What is `log_every_n` for? – Klaus D. Aug 14 '20 at 20:04
  • I wouldn't ordinarily expect it to get slower, no. (1) Have you profiled to see which lines are slow? (2) What does `nlps` do? (3) Have you plotted runtime (or the square root thereof) vs how far you are in the file? It's pretty common for the OS, a library, or something on top to accidentally have quadratic performance or to redo a bunch of work as you get deeper into a file/directory. – Hans Musgrave Aug 14 '20 at 20:18
  • @KlausD. OP can answer when they get to it, but in the meantime I'd guess that we're seeing a snippet and later on they have something like `if not i % log_every_n: log()` – Hans Musgrave Aug 14 '20 at 20:20
  • 1
    Everything becomes slower and slower when the input becomes bigger. The question is how big and how slow. BTW, your code is not complete, so it's hard to say what you are doing. What is `tokens` in `freq_counter.update(tokens)`? You should make a [mcve] – zvone Aug 14 '20 at 20:22

0 Answers0