Expected and predicted arrays ending up to be the same in scikit learn random forest model

Question

data = df_train.as_matrix(columns=train_vars)  # All columns aside from 'output'
target = df_train.as_matrix(columns=['output']).ravel()

# Get training and testing splits
splits = cross_validation.train_test_split(data, target, test_size=0.2)
data_train, data_test, target_train, target_test = splits

# Fit the training data to the model
model = RandomForestRegressor(100)
model.fit(data_train, target_train)

# Make predictions
expected = target_test
predicted = model.predict(data_test)

When I run this code to predict the variable 'output' as a function of all other variables in this file: https://www.dropbox.com/s/cgyh09q2liew85z/uuu.csv?dl=0

The expected and predicted arrays are exactly the same. Seems like I am overfitting or doing something wrong. How to fix it?

Depends on complexity of data. Can you run same experiment but use 0.5 for train and 0.5 for test? — Farseer, Jan 25 '16 at 14:26

score 1 · Accepted Answer · edited May 23 '17 at 11:44

Kudos for questioning too good results!

Each feature (column) in the data contains only a small amount of distinct values. If I counted correctly, there are only 14 uniquely different rows.

This has two implications:

You are very likely to be overfitting because you only have 14 effective samples but 36 features.
The same rows are very likely to appear in the testing set and in the training set again. This means you are testing on the same data that the model was trained on. Since the model is perfectly overfitted to this data you get perfect results.

Edit

I just realized I haven't answered the actual question - How to fix it?

That depends.

If you are lucky, someone made an error in preparing the data.

If the data is correct, things will be more difficult. First, get rid of duplicate rows, for example by doing np.vstack({tuple(row) for row in data}) (see here). Then try if you can do some meaningful work with it. But to be honest, I believe 14 samples is a bit low for doing machine learning. Try to get more data :)

thanks @kazemakase, you are right, there is a bug in my code causing the low number of unique rows — user308827, Jan 25 '16 at 15:57
Ah, I just edited my answer. Looks like you drew the lucky option, then :) — MB-F, Jan 25 '16 at 16:00

Expected and predicted arrays ending up to be the same in scikit learn random forest model

1 Answers1