1

I have a for loop that gets slower over time ( x10 slower). The loop iterates over a very large corpus of tweets (7M) to find keywords passed through a dictionary. If the keywords is in the tweet, a df is updated.

for n, sent in enumerate(corpus):
    for i, word in words['token'].items():
        tag_1 = words['subtype_I'][i]
        tag_2 = words['subtype_II'][i]
        if re.findall(word, sent):
            df = pd.DataFrame([[sent, tag_1, tag_2, word]], columns=['testo', 'type',
                                                                     'type_2','trigger'])
            data = data.append(df)
            print(n)
        else:
            continue

It starts processing 1000 lines per second more or less, after 900K iterations it slows down to 100.

What I'm missing here? Memory allocation problem? Is there a way to speed this up?

martineau
  • 119,623
  • 25
  • 170
  • 301
  • [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) BTW, you don't need the `else:`. Are you actually printing `n` 900K times? – martineau Sep 12 '21 at 11:01
  • 1
    `df.append` creates a new larger dataframe with every iteration. Better way is to append each dataframe to a `list` and [`pd.concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) the list of dataframes outside of the loop. – Michael Szczesny Sep 12 '21 at 11:05
  • @martineau it was just a dirty and fast way to see how many lines were being processed – Leonardo Sanna Sep 12 '21 at 11:08
  • @MichaelSzczesny with your suggestion I managed to conclude the loop in 3 hours (more or less), thanks! – Leonardo Sanna Sep 12 '21 at 15:24

1 Answers1

0

Maybe I misunderstand the structures you are dealing with... But I'm under the impression that building partial dataframes isn't the optimal way. I'd build up the data in a list comprehension and then the dataframe in one go:

data = [
    [sent, words['subtype_I'][i], words['subtype_II'][i], word]
    for sent in corpus
    for i, word in words['token'].items() if re.find(word, sent)
]
data = pd.DataFrame(data, columns=['testo', 'type', 'type_2', 'trigger'])
Timus
  • 10,974
  • 5
  • 14
  • 28
  • `corpus `is a plain txt file, 7M lines, `words` is a dictionary. I managed to finish in "reasonable time with the suggestion commented above (appending to `list ` and then use `pd.concat`. Why would you use list comprehension like you did? Genuine question, cause I'm missing the idea behind that choice, – Leonardo Sanna Sep 13 '21 at 15:41
  • @LeonardoSanna Sure: [List comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) are the most (natively) optimized way of building list-like structures: Optimized regarding speed and memory. The version you choose also builds a list - the `concat`-list - but through successive appending, which involves a lot of resizing (every time the overallocated space is depleted a completely new list is created, again with memory overallocation, and the old list is copied into the new one). – Timus Sep 13 '21 at 17:10
  • 1
    @LeonardoSanna And you build a lot of dataframes, which all come with the respective overhead (memory and time for the initialization process). For example, you store the column names over and over again. And overall a lot of function calls (`.append()` and `pd.DataFrame()`) which all cost time. That adds up. And then you have to concat all those dataframes. – Timus Sep 13 '21 at 17:32