16

I tried this but couldn't get it to work for my data: Use Scikit Learn to do linear regression on a time series pandas data frame

My data consists of 2 DataFrames. DataFrame_1.shape = (40,5000) and DataFrame_2.shape = (40,74). I'm trying to do some type of linear regression, but DataFrame_2 contains NaN missing data values. When I DataFrame_2.dropna(how="any") the shape drops to (2,74).

Is there any linear regression algorithm in sklearn that can handle NaN values?

I'm modeling it after the load_boston from sklearn.datasets where X,y = boston.data, boston.target = (506,13),(506,)

Here's my simplified code:

X = DataFrame_1
for col in DataFrame_2.columns:
    y = DataFrame_2[col]
    model = LinearRegression()
    model.fit(X,y)

#ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I did the above format to get the shapes to match up of the matrices

If posting the DataFrame_2 would help, please comment below and I'll add it.

Community
  • 1
  • 1
O.rka
  • 29,847
  • 68
  • 194
  • 309

2 Answers2

6

You can fill in the null values in y with imputation. In scikit-learn this is done with the following code snippet:

from sklearn.preprocessing import Imputer
imputer = Imputer()
y_imputed = imputer.fit_transform(y)

Otherwise, you might want to build your model using a subset of the 74 columns as predictors, perhaps some of your columns contain less null values?

maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • I tried this with my column and got `TypeError: unbound method fit_transform() must be called with Imputer instance as first argument (got Series instance instead)` then tried it with the whole DataFrame and got the same thing (w/ DataFrame instead of Series) – O.rka Oct 13 '15 at 23:21
  • 2
    with scikit you need to call things on the underlying numpy arrays not the dataframes themself; you should have already set `X=DataFrame_1.values` and `y=Dataframe_2.values` – maxymoo Oct 13 '15 at 23:32
  • oops also i gave you the wrong syntax for the imputer, i've fixed the code – maxymoo Oct 13 '15 at 23:35
  • I saw somehwere `imp = Imputer(missing_values='NaN', strategy='mean', axis=0)` . How would this alter the default? – O.rka Oct 13 '15 at 23:49
  • those are the defaults, i guess it's just showing you that you can change them if you want – maxymoo Oct 14 '15 at 00:26
  • 1
    What I don't understand is why `y_imputed` returns a numpy array that differs in length from the original len of the column in the original data frame? The non-imputed numbers are included, so its not a matter of showing "imputed only". What gives? – Thomas Matthew Nov 05 '15 at 07:16
  • @ThomasMatthew the shape of the arrays should definitely be the same, if you're still having this problem, post a question about it and someone should be able to help you. – maxymoo Nov 05 '15 at 22:49
  • 2
    Imputer is deprecated in sklearn 0.23.2, use sklearn.impute.SimpleImputer – 5norre Sep 11 '20 at 10:59
4

If your variable is a DataFrame, you could use fillna. Here I replaced the missing data with the mean of that column.

df.fillna(df.mean(), inplace=True)
Foreever
  • 7,099
  • 8
  • 53
  • 55