scikit-learn : RandomForestRegressor : ValueError when training

Question

I have a pandas series s that contains my label, and a pandas DataFrame df that contains my data. I want to use the sklearn RandomForestRegressor to generate predictions of my label.

model = RandomForestRegressor(n_estimators=1000, max_depth= 30 , random_state=31415)
model.fit(df, s)

But when I do so, the .fit() throws the following exception :

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

And I don't really understand why I have this error. My label, and all the columns of my DataFrame are numerical :

print(s.describe(), header=True)

count      1168.000000
mean     181193.444349
std       81756.636708
min       34900.000000
25%      129000.000000
50%      160000.000000
75%      214600.000000
max      755000.000000
Name: Label, dtype: float64

print(df.describe())

            Field1       Field2       Field3       Field4       Field5       Field6       Field7        Field8
count  1168.000000  1168.000000  1168.000000  1168.000000  1168.000000  1168.000000  1168.000000   1168.000000 
mean      6.080479  1519.982877     1.749144  1057.800514     0.973459     1.556507  1970.724315   1984.442637 
std       1.392363   540.953069     0.760811   444.809832     0.160807     0.554077    29.939059     20.626356 
min       1.000000   334.000000     0.000000     0.000000     0.000000     0.000000  1872.000000   1950.000000 
25%       5.000000  1123.750000     1.000000   795.750000     1.000000     1.000000  1953.750000   1966.000000 
50%       6.000000  1465.000000     2.000000   990.000000     1.000000     2.000000  1972.000000   1993.000000 
75%       7.000000  1786.000000     2.000000  1291.500000     1.000000     2.000000  2000.000000   2003.000000 
max      10.000000  5642.000000     4.000000  6110.000000     1.000000     3.000000  2010.000000   2010.000000

I also have no null values in both s and df :

print(np.isnan(s).unique())

[False]


print(df.isnull().sum().sort_values(ascending=False))

Field8     0
Field7     0
Field5     0
Field5     0
Field4     0
Field3     0
Field2     0
Field1     0
dtype: int64

I even manually checked my data and didn't see any weird values.

What can be causing this error ?

EDIT :

After trying multiples things, I have found a solution (even if I don't quite understand why this solve my problem).

In my case, adding

df.reset_index(drop=True)

before the .fit() call solved the problem (as proposed here). If someone understand what's going on here, I am interested.

do check this link https://datascience.stackexchange.com/questions/11928/valueerror-input-contains-nan-infinity-or-a-value-too-large-for-dtypefloat32 — Akhilesh_IN, Apr 02 '19 at 09:20
@Nakeuh you should post an answer to your own question (and accept it) with the solution you found — CAPSLOCK, Apr 02 '19 at 09:38

score 0 · Answer 1 · answered Apr 02 '19 at 09:13

0

it could be due to the huge difference in data (e.g. field 1 is in range: ~1 to ~10 and field 2 in range: ~300 to ~5000)

Try applying feature scaling and then fit the model.

scaler = MinMaxScaler()

df = scaler.fit_transform(df)

answered Apr 02 '19 at 09:13

ML-DL-Learner

25
5

Rami Azmi · Answer 2 · 2019-04-04T13:40:34.023

Sounds that there are missing values. It could also be that some data points are not viewed as NaN by the isnan/isnull functions because it has spaces or such characters that is not acceptable by ML models which it must only be numerical values.

Please check data types of the dataframe columns using the following line of code:

df.dtypes

I also need you to inform us with the form of both the feature dataframe (df) and the target one (s).

scikit-learn : RandomForestRegressor : ValueError when training

2 Answers2