Working with big dataset in pandas

Question

I have been using this dataset : https://www.kaggle.com/nsharan/h-1b-visa
I have split the main dataframe into two:
soc_null dataframe - where SOC_NAME column has NaN values
soc_not_null - where SOC_NAME column has values other than NaN
For filling NaN values in SOC_NAME column of soc_null dataframe, I came up with this code:

for index1, row1 in soc_null.iterrows():
    for index2, row2 in soc_not_null.iterrows():
        if row1['JOB_TITLE'] == row2['JOB_TITLE']:
            soc_null.set_value(index1,'SOC_NAME',row2['SOC_NAME'])

The problem with this code is that the length of soc_null is 17734 and the length of soc_not_null is 2984724, I ran this for a couple of hours but only a few hundred values were updated and hence it is not possible to execute this n^2 complexity code completely on a single machine.
I believe there has to be a better way to do this and possibly over bigger datasets than mine, since there are several other parts following the cleaning process that will require two loops for processing.

`pandas` and `numpy` data structures are not designed for python-level iteration, you are supposed to apply vectorised `pandas` methods and/or `numpy` functions instead. If you do want to stick with your solution, consider using `numba` or reimplement it in `Cython`. — Eli Korvigo, Nov 19 '17 at 20:09

score 0 · Accepted Answer · answered Nov 19 '17 at 20:39

There are some nice posts that explain what you need. Here's one solution:

import pandas as pd
import numpy as np

values = [
    {'JOB_TITLE':'secretary', 'SALARY':30000},
    {'JOB_TITLE':'programmer', 'SALARY':60000},
    {'JOB_TITLE':'manager', 'SALARY':None},
    {'JOB_TITLE':'president', 'SALARY':None},
]

secret_values = [
    {'JOB_TITLE':'manager', 'SALARY':150000},
    {'JOB_TITLE':'president', 'SALARY':1000000},
]

df = pd.DataFrame(values)
df_secret = pd.DataFrame(secret_values)
df.set_index('JOB_TITLE', inplace=True)
df_secret.set_index('JOB_TITLE', inplace=True)

df.combine_first(df_secret).reset_index()

PS: Avoid working with for-each loops on large datasets. Use the Pandas.DataFrame and other optimized stuff.

Working with big dataset in pandas

1 Answers1