0

I have been using this dataset : https://www.kaggle.com/nsharan/h-1b-visa
I have split the main dataframe into two:
soc_null dataframe - where SOC_NAME column has NaN values
soc_not_null - where SOC_NAME column has values other than NaN
For filling NaN values in SOC_NAME column of soc_null dataframe, I came up with this code:

for index1, row1 in soc_null.iterrows():
    for index2, row2 in soc_not_null.iterrows():
        if row1['JOB_TITLE'] == row2['JOB_TITLE']:
            soc_null.set_value(index1,'SOC_NAME',row2['SOC_NAME'])

The problem with this code is that the length of soc_null is 17734 and the length of soc_not_null is 2984724, I ran this for a couple of hours but only a few hundred values were updated and hence it is not possible to execute this n^2 complexity code completely on a single machine.
I believe there has to be a better way to do this and possibly over bigger datasets than mine, since there are several other parts following the cleaning process that will require two loops for processing.

Ronak Thakkar
  • 59
  • 1
  • 1
  • 7
  • 4
    `pandas` and `numpy` data structures are not designed for python-level iteration, you are supposed to apply vectorised `pandas` methods and/or `numpy` functions instead. If you do want to stick with your solution, consider using `numba` or reimplement it in `Cython`. – Eli Korvigo Nov 19 '17 at 20:09

1 Answers1

0

There are some nice posts that explain what you need. Here's one solution:

import pandas as pd
import numpy as np

values = [
    {'JOB_TITLE':'secretary', 'SALARY':30000},
    {'JOB_TITLE':'programmer', 'SALARY':60000},
    {'JOB_TITLE':'manager', 'SALARY':None},
    {'JOB_TITLE':'president', 'SALARY':None},
]

secret_values = [
    {'JOB_TITLE':'manager', 'SALARY':150000},
    {'JOB_TITLE':'president', 'SALARY':1000000},
]

df = pd.DataFrame(values)
df_secret = pd.DataFrame(secret_values)
df.set_index('JOB_TITLE', inplace=True)
df_secret.set_index('JOB_TITLE', inplace=True)

df.combine_first(df_secret).reset_index()

PS: Avoid working with for-each loops on large datasets. Use the Pandas.DataFrame and other optimized stuff.

Matthias
  • 5,574
  • 8
  • 61
  • 121