0

I am trying to replace NaN values in a given dataset by the column mean using sklearn.preprocessing.Imputer. Instead of having NaN replaced I instead find that they are being removed by my code. Here is a short example demonstrating this issue I am facing:

>>> test_data = np.array([float("NaN"), 1, 2, 3])
>>> imp = Imputer(missing_values=float("NaN"), strategy="mean")
>>> imp.fit_transform(test_data)
** Deprecation warning truncated **
array([[1., 2., 3.]])

What should I change so that instead of removing the NaN it gets replaced by 2. ?

I tried to adapt from the sklearn.preprocessing.Imputer user guide and was originally following this answer but I must have misunderstood them.

Edit:

I have also tried the following, which gets rid of the deprecation warning but does not change the end result:

>>> test_data = np.array([[float("NaN"), 1, 2, 3]])
>>> imp = Imputer(missing_values=float("NaN"), strategy="mean")
>>> imp.fit_transform(test_data)
array([[1., 2., 3.]])
ApplePie
  • 8,814
  • 5
  • 39
  • 60

2 Answers2

2

The Imputer expects a data frame. This works as expected -

import pandas as pd
from sklearn.preprocessing import Imputer

test_series = pd.Series([float("NaN"), 1, 2, 3])
test_data_frame = pd.DataFrame({"test_series": test_series})
imp = Imputer(missing_values=float("NaN"), strategy="mean")
test_data_frame = imp.fit_transform(test_data_frame)
print(test_data_frame)
selotape
  • 846
  • 1
  • 13
  • 26
  • 1
    Interestingly, this seems to work whether axis is set to 0 or 1 whereas my solution using a numpy array works exclusively by using either columns or rows, depending how the data was arranged. – ApplePie Jun 23 '17 at 18:51
1

I have found the answer to my question by re-reading the documentation for sklearn.preprocessing.Imputer. It turns out that I was leaving out the axis parameter of the Imputer() 'constructor'. By default, it is set to 0 (apply strategy on columns) but I was passing a row of data so I should have used axis = 1.

This is the end result, as expected initially.

>>> test_data = np.array([float("NaN"), 1, 2, 3])
>>> imp = Imputer(missing_values=float("NaN"), strategy="mean", axis=1)
>>> imp.fit_transform(test_data)
array([[2., 1., 2., 3.]])
ApplePie
  • 8,814
  • 5
  • 39
  • 60