How can I speed up my Python for loop that goes through rows of lists?

Question

This is my current code

df['company_id'] = ''
length = 0
while length < len(df):
  for x in df:
    if df['associations.companies.results'][length] == 'nan':
      df.loc[df['associations.companies.results'] == 'nan', 'company_id'] = 0  
    else:
      df['company_id'][length] = df['associations.companies.results'][length][0]['id']
  length = length +1

I tried to run this code with Lambda and np.where versions, however, these gave errors that I couldn't solve. The data set has close to 40 rows and I try to get the company ID out of a dict nested in a list. It looks like this on each row:

[{'id': 'XXXXXXXXXX', 'type': 'call_to_company'}]

sometimes there is no company_id and it will look like:

nan

The final result would be a separate column called "company_id" that contains the 'id' value.

Right now the code has been running for 30 mins and still going strong

Hope anyone can help. Thanks!

Hi, could you please include an example of your df? – bitflip Oct 18 '22 at 09:58 — bitflip, Oct 18 '22 at 09:58

score 3 · Accepted Answer · answered Oct 18 '22 at 10:12

There are various improvements that you could make, but i'm still not entirely sure what kind of output you are expecting.

First of all you execute the len() function at each iteration, because you put it in the header of the while loop, this is an error, since you need to execute it only once.

Second: you have a double for loop (I think because you wanted to iterate both through indexes and for the elements), but this is a big error since this way you have a O(n^2) complexity instead of a O(n) one. You could've use enumerate(df) or simply use only the indexes

df['company_id'] = ''

for i in range(len(df)):
    if df['associations.companies.results'][i] == 'nan':
      df.loc[df['associations.companies.results'] == 'nan', 'company_id'] = 0  
    else:
      df['company_id'][i] = df['associations.companies.results'][i][0]['id']

I'm sure this could be further improved with lists comprehension or DataFrame .apply(), but I still don't understand your goal, so this is the most I can do.

If you've never heard before of Big-O notation I recommend you to read this

You absolute hero! Did the job in 2 mins. I am going to check that article now! Thanks for being so helpful — Opper_Draak, Oct 18 '22 at 10:21

score 2 · Answer 2 · answered Oct 18 '22 at 10:00

2

Hope I understood your use case, so here is my idea:

Try using foreach and enumerate()! With this, you can totally avoid having a counter variable.

Like so:

df['company_id'] = ''
for i, x in enumerate(df):
    if df['associations.companies.results'][i] == 'nan':
        df.loc[df['associations.companies.results'] == 'nan', 'company_id'] = 0  

    else:
        df['company_id'][i] = df['associations.companies.results'][i][0]['id']

Sadly, your code is not so reproducible, so I hope I was able to understand

answered Oct 18 '22 at 10:00

Marco Frag Delle Monache

1,075
5
16

Dude, thank you so much for your fast response! so the code runs, however it only fills the if statement. The copany_id column stays empty. What can I do to make the code more reproducable? – Opper_Draak Oct 18 '22 at 10:18
please add an example of your df variable. I was forced to use your example! Try using my code but replacing my df declaration with yours – Marco Frag Delle Monache Oct 18 '22 at 10:55

How can I speed up my Python for loop that goes through rows of lists?

2 Answers2