0

I have been running a for loop for 4 days now. I understand that there are better ways to do it, but I needed to run it just once. Is there a way to check the progress without interrupting the loop? I'm terrified to loose all of my progress. This is my code

#scoring the sentences
for i in range(len(df.hope)):
  for word in words:
    df.hope[i] += df.text[i].count(word)
  for word_f in words_f:
    df.fear[i] += df.text[i].count(word_f)
  • My first thought would be to start the debugger and inspect the variables, but it seems you aren't able to start the debugger while running a cell. If you would've had the debugger running before running the cell it would be possible. – Marcel Aug 13 '22 at 23:24
  • What does your data look like? It might be better to write an implementation with better complexity and run that instead? It could be less than a minute instead of 4+ days. – Marcel Aug 13 '22 at 23:28
  • You don't have to stop that process in order to run a new one with a better implementation. – Marcel Aug 13 '22 at 23:28
  • I have a dataset with 1300000 reddit comments and I want to count how many times words from two lists (1500 and 500 words) recur in any given submission – user18694315 Aug 13 '22 at 23:43

1 Answers1

0

Given the information from your comment, here is the most efficient way of computing it in Python (as far as I know).

df['hope'] = df.text.str.count("(" + "|".join(words) + ")")
df['fear'] = df.text.str.count("(" + "|".join(words_f) + ")")

Hopefully this will terminate in a reasonable amount of time!

Edit:

I forgot to add that you have to cast the text column to the "string" datatype. Before doing the above, cast the text column using

df.text = df.text.astype("string")
Marcel
  • 958
  • 1
  • 7
  • 18
  • Did you cast the data to string as well? – Marcel Aug 14 '22 at 00:03
  • I already had it in my code, since I was transforming everything to lowercase. It already finished the first list (12 minutes), man this is amazing, thank you very much! – user18694315 Aug 14 '22 at 00:09
  • @user18694315 as this answer demonstrates, iterating over a pandas dataframe one item at a time is almost always not a good idea. Pandas (and NumPy, which underlies it) is built on parallelism and the concept of [broadcasting](https://stackoverflow.com/questions/29954263/what-does-the-term-broadcasting-mean-in-pandas-documentation). It's (usually) *much* faster than looping. – MattDMo Aug 14 '22 at 00:28