I stuck with quite obvious task.
I have a df with missing data. For processing such kind of data I want to test two dataFrames.
For the first one X_real_zeros
- I replace missing with 0.
And for the second one X_real_means
- with column's mean.
I have collected all numeric columns name in one array
numeric_cols = ['RFCD.Percentage.1', 'RFCD.Percentage.2', 'RFCD.Percentage.3',
'RFCD.Percentage.4', 'RFCD.Percentage.5',
'SEO.Percentage.1', 'SEO.Percentage.2', 'SEO.Percentage.3',
'SEO.Percentage.4', 'SEO.Percentage.5',
'Year.of.Birth.1', 'Number.of.Successful.Grant.1', 'Number.of.Unsuccessful.Grant.1']
Then I'm trying to create two dataFrames.
data = pd.read_csv('data.csv')
X_real_zeros = data
for col in numeric_cols:
X_real_zeros[col] = data[col].fillna(0)
X_real_means = data
a = calculate_means(data[numeric_cols])
for col in numeric_cols:
print(a[col], col)
X_real_means[col] = data[col].fillna(a[col])
But, when I want to create the second one, it turns out, that my data
data frame has been modified. Anyway I think my approach is not accurate, what is the proper way of solving such tasks?