I have a pandas series s
that contains my label, and a pandas DataFrame df
that contains my data.
I want to use the sklearn RandomForestRegressor to generate predictions of my label.
model = RandomForestRegressor(n_estimators=1000, max_depth= 30 , random_state=31415)
model.fit(df, s)
But when I do so, the .fit()
throws the following exception :
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
And I don't really understand why I have this error. My label, and all the columns of my DataFrame are numerical :
print(s.describe(), header=True)
count 1168.000000
mean 181193.444349
std 81756.636708
min 34900.000000
25% 129000.000000
50% 160000.000000
75% 214600.000000
max 755000.000000
Name: Label, dtype: float64
print(df.describe())
Field1 Field2 Field3 Field4 Field5 Field6 Field7 Field8
count 1168.000000 1168.000000 1168.000000 1168.000000 1168.000000 1168.000000 1168.000000 1168.000000
mean 6.080479 1519.982877 1.749144 1057.800514 0.973459 1.556507 1970.724315 1984.442637
std 1.392363 540.953069 0.760811 444.809832 0.160807 0.554077 29.939059 20.626356
min 1.000000 334.000000 0.000000 0.000000 0.000000 0.000000 1872.000000 1950.000000
25% 5.000000 1123.750000 1.000000 795.750000 1.000000 1.000000 1953.750000 1966.000000
50% 6.000000 1465.000000 2.000000 990.000000 1.000000 2.000000 1972.000000 1993.000000
75% 7.000000 1786.000000 2.000000 1291.500000 1.000000 2.000000 2000.000000 2003.000000
max 10.000000 5642.000000 4.000000 6110.000000 1.000000 3.000000 2010.000000 2010.000000
I also have no null values in both s
and df
:
print(np.isnan(s).unique())
[False]
print(df.isnull().sum().sort_values(ascending=False))
Field8 0
Field7 0
Field5 0
Field5 0
Field4 0
Field3 0
Field2 0
Field1 0
dtype: int64
I even manually checked my data and didn't see any weird values.
What can be causing this error ?
EDIT :
After trying multiples things, I have found a solution (even if I don't quite understand why this solve my problem).
In my case, adding
df.reset_index(drop=True)
before the .fit()
call solved the problem (as proposed here).
If someone understand what's going on here, I am interested.