0

This is my current code

df['company_id'] = ''
length = 0
while length < len(df):
  for x in df:
    if df['associations.companies.results'][length] == 'nan':
      df.loc[df['associations.companies.results'] == 'nan', 'company_id'] = 0  
    else:
      df['company_id'][length] = df['associations.companies.results'][length][0]['id']
  length = length +1

I tried to run this code with Lambda and np.where versions, however, these gave errors that I couldn't solve. The data set has close to 40 rows and I try to get the company ID out of a dict nested in a list. It looks like this on each row:

[{'id': 'XXXXXXXXXX', 'type': 'call_to_company'}]

sometimes there is no company_id and it will look like:

nan

The final result would be a separate column called "company_id" that contains the 'id' value.

Right now the code has been running for 30 mins and still going strong

Hope anyone can help. Thanks!

2 Answers2

3

There are various improvements that you could make, but i'm still not entirely sure what kind of output you are expecting.

First of all you execute the len() function at each iteration, because you put it in the header of the while loop, this is an error, since you need to execute it only once.

Second: you have a double for loop (I think because you wanted to iterate both through indexes and for the elements), but this is a big error since this way you have a O(n^2) complexity instead of a O(n) one. You could've use enumerate(df) or simply use only the indexes

df['company_id'] = ''

for i in range(len(df)):
    if df['associations.companies.results'][i] == 'nan':
      df.loc[df['associations.companies.results'] == 'nan', 'company_id'] = 0  
    else:
      df['company_id'][i] = df['associations.companies.results'][i][0]['id']

I'm sure this could be further improved with lists comprehension or DataFrame .apply(), but I still don't understand your goal, so this is the most I can do.

If you've never heard before of Big-O notation I recommend you to read this

Mamiglia
  • 108
  • 7
2

Hope I understood your use case, so here is my idea:

Try using foreach and enumerate()! With this, you can totally avoid having a counter variable.

Like so:

df['company_id'] = ''
for i, x in enumerate(df):
    if df['associations.companies.results'][i] == 'nan':
        df.loc[df['associations.companies.results'] == 'nan', 'company_id'] = 0  

    else:
        df['company_id'][i] = df['associations.companies.results'][i][0]['id']

Sadly, your code is not so reproducible, so I hope I was able to understand

  • Dude, thank you so much for your fast response! so the code runs, however it only fills the if statement. The copany_id column stays empty. What can I do to make the code more reproducable? – Opper_Draak Oct 18 '22 at 10:18
  • please add an example of your df variable. I was forced to use your example! Try using my code but replacing my df declaration with yours – Marco Frag Delle Monache Oct 18 '22 at 10:55