I have been using this dataset : https://www.kaggle.com/nsharan/h-1b-visa
I have split the main dataframe into two:
soc_null dataframe - where SOC_NAME column has NaN values
soc_not_null - where SOC_NAME column has values other than NaN
For filling NaN values in SOC_NAME column of soc_null dataframe, I came up with this code:
for index1, row1 in soc_null.iterrows():
for index2, row2 in soc_not_null.iterrows():
if row1['JOB_TITLE'] == row2['JOB_TITLE']:
soc_null.set_value(index1,'SOC_NAME',row2['SOC_NAME'])
The problem with this code is that the length of soc_null is 17734 and the length of soc_not_null is 2984724, I ran this for a couple of hours but only a few hundred values were updated and hence it is not possible to execute this n^2 complexity code completely on a single machine.
I believe there has to be a better way to do this and possibly over bigger datasets than mine, since there are several other parts following the cleaning process that will require two loops for processing.