0

Wondering if anyone can help me with this problem. I am working on a machine learning problem, I have classified the df1[Age] column into df1[Age_group]. Unfortunately there are missing data, so any df[Age] which is NaN is classified as 3.

Currently the classification of 3 only means "missing data", and I want to update this to something useful. I have used scikit-learn logistic regression to guess the missing age groups and they are now stored in a Numpy array which I have called missing_age_grps.

Obviously the data set I am working with is much bigger, but below should be enough data to illustrate the problem.

In the example below missing_age_grps is an array of only 2, due to only 2 instances of df1[Age_group] == 3

import pandas as pd
import numpy as np

d = {'ID': [0, 1, 2, 3, 4], 'Sex': ["Male","Female","Male","Male", "Female"], 'Age':[np.nan, 23, np.nan, 6, 15] , 'Age_group':[3,2,3,0,1]}
df1 = pd.Dataframe(d)

print(df1)

ID   Sex         Age  Age_group
0    Male        NaN      3   
1    Female      23       2   
2    Male        NaN      3
3    Male        6        0
4    Female      15       1 /....

print(missing_age_grps)

[0, 1]

I am having trouble re-writing only the values in the df1['Age_group'] which are represented by 3.

The ideal solution will update only the 3's with the values from the numpy array. This is the expected output:

print(df1)

ID   Sex         Age  Age_group
0    Male        NaN      0   
1    Female      23       2   
2    Male        NaN      1
3    Male        6        0
4    Female      15       1 /....
hamslice
  • 115
  • 3
  • 10

1 Answers1

4

As I do not see any numpy.array I will just make a value for those value and replace it.

import pandas as pd
import numpy as np

d = {'ID': [0, 1, 2, 3, 4], 'Sex': ["Male","Female","Male","Male", "Female"], 'Age':[np.nan, 23, np.nan, 6, 15] , 'Age_group':[3,2,3,0,1]}
df1 = pd.DataFrame(d)
replacement_array = np.array([22, 23])
df1.loc[df1['Age_group'] ==3, 'Age_group'] = replacement_array
print(df1)

The logic is just to replace your subset of values in your case df1['Age_group'] == 3 with your replacement value

DaveR
  • 1,696
  • 18
  • 24
  • Sorry if this isn't obvious from the question, but the replacement value isn't always 2. It may be 0, 1, or 2. I will update the question to reflect this ... – hamslice Jun 15 '20 at 10:56
  • 1
    so use `replacement_value = [0,1]` – jezrael Jun 15 '20 at 10:58
  • 1
    I adjusted the answer as per your request @hamslice – DaveR Jun 15 '20 at 11:03
  • 1
    Actually I am not able to pass a list into this, I get the following error ```ValueError: Must have equal len keys and value when setting with an iterable``` – hamslice Jun 15 '20 at 11:05
  • check the updated code, probably you are doing this `df1.loc[df1['Age_group'] ==3, :] = replacement_array` instead of `df1.loc[df1['Age_group'] ==3, "Age_group"] = replacement_array` – DaveR Jun 15 '20 at 11:06