I have a for loop that gets slower over time ( x10 slower). The loop iterates over a very large corpus of tweets (7M) to find keywords passed through a dictionary. If the keywords is in the tweet, a df is updated.
for n, sent in enumerate(corpus):
for i, word in words['token'].items():
tag_1 = words['subtype_I'][i]
tag_2 = words['subtype_II'][i]
if re.findall(word, sent):
df = pd.DataFrame([[sent, tag_1, tag_2, word]], columns=['testo', 'type',
'type_2','trigger'])
data = data.append(df)
print(n)
else:
continue
It starts processing 1000 lines per second more or less, after 900K iterations it slows down to 100.
What I'm missing here? Memory allocation problem? Is there a way to speed this up?