0

I trying to use sklearn and ran into an error, but I have no idea what is wrong. This is my code:

import pandas as pdd
from sklearn.tree import DecisionTreeClassifier
df = pd.read_csv('vgsales.csv')
X = df.drop(columns=['Name','Platform','Publisher','Genre'])#input
y = df['Rank']#output
model = DecisionTreeClassifier()
model.fit(X, y)
predictions = model.predict([16598],[])
predictions

This is my error:

ValueError                                Traceback (most recent call last)
<ipython-input-28-152586bc1b23> in <module>()
      8 df = df.reset_index()
      9 model = DecisionTreeClassifier()
---> 10 model.fit(X, y)
     11 predictions = model.predict([16598],[])
     12 predictions

/home/frankie/anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.pyc in fit(self, X, y, 
sample_weight, check_input, X_idx_sorted)
    788             sample_weight=sample_weight,
    789             check_input=check_input,
--> 790             X_idx_sorted=X_idx_sorted)
    791         return self
    792 

/home/frankie/anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.pyc in fit(self, X, y, 
sample_weight, check_input, X_idx_sorted)
    114         random_state = check_random_state(self.random_state)
    115         if check_input:
--> 116             X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    117             y = check_array(y, ensure_2d=False, dtype=None)
    118             if issparse(X):

/home/frankie/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.pyc in 
check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, 
ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    451                              % (array.ndim, estimator_name))
    452         if force_all_finite:
--> 453             _assert_all_finite(array)
    454 
    455     shape_repr = _shape_repr(array.shape)

/home/frankie/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.pyc in 
_assert_all_finite(X)
     42             and not np.isfinite(X).all()):
     43         raise ValueError("Input contains NaN, infinity"
---> 44                          " or a value too large for %r." % X.dtype)
     45 
     46 

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Any help would be greatly appreciated and I am typing this stuff because stackoverflow is saying I need more text

1 Answers1

0

Errors in your code

When you define X and Y for train, the matrix X will contain de column Rank. You should drop it too. Otherwise, your decision tree will be "silly", because you are giving as input, the output. That's a huge mistake. Solving:

X = df.drop(columns=['Name','Platform','Publisher','Genre', 'Rank'])#input

You have another error with the predict. If you want to predict which Rank will be one input, you have to give a sample or sample with the same format as X. For example, if you want ask for the prediction for all your X:

predictions = model.predict(X)

You will obtain a prediction for every X row. If you want to ask for a concrete prediction of one row, you have to define it.

I recommend you use sklearn.model_selection.train_test_split. Here more info.

Check importing pandas (you imported as pdd):

import pandas as pd 

Explaining Error you asked

As you can see on you error, the first error occurs in line 10:

---> 10 model.fit(X, y)

With this information, and combined with the last line of the error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

We know that the problem is in the function fit. The function is complaining because in your dataFrame there are empty values (NaN), infinity values or too large values.


Checking

To solve this, first I recommend you check if you have NaN values:

df.isnull().any().any() 

This command return True if there are NaN values in your dataFrame, False otherwise. Click here for more information.

Probably you will obtain a True value, because you have NaN values.


Solving

It's clear that we have to drop or change the NaN values, because the function fit don't work with this values.

Drop: if you have just a few NaN values I strongly recommend you to drop all these rows:

df.dropna()

Change: another solution is to change de NaN values for 0. This will also solve this problem, but you have to be aware that you are modifying your dataFrame with this step.

df.fillna()

There are also other options to sustitute NaN values, for example, for the mean value of the column, the most repeated, ...


The final code should be something like this:

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
df = pd.read_csv('vgsales.csv')
X = df.drop(columns=['Name','Platform','Publisher','Genre', 'Rank'])#input
y = df['Rank']#output
model = DecisionTreeClassifier()
model.fit(X, y)
predictions = model.predict(X)
predictions

I hope this helps you to solve your problem! :)

Alex Serra Marrugat
  • 1,849
  • 1
  • 4
  • 14