1

I have a dataset with a column 'Self_Employed'. In these columns are values 'Yes', 'No' and 'NaN. I want to replace the NaN values with a value that is calculated in calc(). I've tried some methods I found on here, but I couldn't find one that was applicable to me. Here is my code, I put the things i've tried in comments.:

    # Handling missing data - Self_employed
SEyes = (df['Self_Employed']=='Yes').sum()
SEno = (df['Self_Employed']=='No').sum()

def calc():
    rand_SE = randint(0,(SEno+SEyes))
    if rand_SE > 81:
        return 'No'
    else:
        return 'Yes'


> # df['Self_Employed'] = df['Self_Employed'].fillna(randint(0,100))
> #df['Self_Employed'].isnull().apply(lambda v: calc())
> 
> 
> # df[df['Self_Employed'].isnull()] = df[df['Self_Employed'].isnull()].apply(lambda v: calc())  
> # df[df['Self_Employed']]
> 
> # df_nan['Self_Employed'] = df_nan['Self_Employed'].isnull().apply(lambda v: calc())
> # df_nan
> 
> #  for i in range(df['Self_Employed'].isnull().sum()):
> #      print(df.Self_Employed[i]


df[df['Self_Employed'].isnull()] = df[df['Self_Employed'].isnull()].apply(lambda v: calc())
df

now the line where i tried it with df_nan seems to work, but then I have a separate set with only the former missing values, but I want to fill the missing values in the whole dataset. For the last row I'm getting an error, i linked to a screenshot of it. Do you understand my problem and if so, can you help?

This is the set with only the rows where Self_Employed is NaN

This is the original dataset

This is the error

3 Answers3

1

Make shure that SEno+SEyes != null use the .loc method to set the value for Self_Employed when it is empty

SEyes = (df['Self_Employed']=='Yes').sum() + 1
SEno = (df['Self_Employed']=='No').sum()

def calc():
    rand_SE = np.random.randint(0,(SEno+SEyes))
    if(rand_SE >= 81):
        return 'No'
    else:
        return 'Yes'

df.loc[df['Self_Employed'].isna(), 'Self_Employed'] = df.loc[df['Self_Employed'].isna(), 'Self_Employed'].apply(lambda x: calc())
Charles R
  • 1,621
  • 1
  • 8
  • 25
0

What about df['Self_Employed'] = df['Self_Employed'].fillna(calc())?

Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75
  • This just does calc() once and used that for every row, instead of doing the calculation per row. I want the NaN's to be filled with Yes's and No's semi-random. – Manolo Viso Romero Nov 08 '18 at 14:38
0

You could first identify the locations of your NaNs like

na_loc = df.index[df['Self_Employed'].isnull()]

Count the amount of NaNs in your column like

num_nas = len(na_loc)

Then generate an according amount of random numbers, readily indexed and set up

fill_values = pd.DataFrame({'Self_Employed': [random.randint(0,100) for i in range(num_nas)]}, index = na_loc)

And finally substitute those values in your dataframe

df.loc[na_loc]['Self_Employed'] = fill_values
Lukas Thaler
  • 2,672
  • 5
  • 15
  • 31
  • So this in fact did fill the NaN's i intended to in my df, but it did also replace all the other values in the same row as the intended NaN row to NaN. So row 11 for example now is: NaN NaN NaN NaN NaN No NaN NaN NaN NaN NaN. – Manolo Viso Romero Nov 08 '18 at 14:48
  • That is because I forgot to select the `Self_Employed` column in the assign statement. It is corrected now – Lukas Thaler Nov 08 '18 at 15:02