2

I was learning the MPL regressor at Google Colaboratory and ran the source code:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()

data = np.array(table)

scaler.fit(data)
y_index = data.shape[1]-1
sd_x = (scaler.var_[:y_index])**0.5
sd_y = (scaler.var_[y_index])**0.5
mean_x = scaler.mean_[:y_index]
mean_y = scaler.mean_[y_index]


x = (data[:, :y_index]).astype(np.float32)
y = (data[:, y_index]).astype(np.float32)

train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.25)
print('Separate training and testing sets!')

It gave the error ValueError: could not convert string to float: 'Photo Editor & Candy Camera & Grid & ScrapBook'.

So I checked the question RandomForestClassfier.fit(): ValueError: could not convert string to float. I also tried sklearn-LinearRegression: could not convert string to float: '--'.

I changed from fit(data) to fit_transform(data), but the same error still insisted. Then I changed from StandardScaler to LabelEncoder, and from scaler = StandardScaler() to scaler = LabelEncoder(). But the different error appeared: ValueError: bad input shape (10841, 13) on the line scaler.fit_transform(data).

You can check the CSV from Kaggle's CSV here. The CSV contains both strings and numbers without quotation marks (except the prices which contain double quotation marks).

Oo'-
  • 203
  • 6
  • 22

3 Answers3

2

From the documentation of sklearn's LabelEncoder: This transformer should be used to encode target values, i.e. y, and not the input X.

Particularly, it's not intended to fit a LabelEncoder on the full dataset.

If you just want to replace the values of the categorical (i.e, string-valued) columns by unique and numeric ids, one way to go is to apply the label encoder (before splitting the data) on each column you want to encode individually. As your sample code imports pandas, I assume that your data has been loaded into a pandas.DataFrame like

df = pd.read_csv('/path/to/googleplaystore.csv')

From there, you can apply the encoder on each column:

df['App'] = LabelEncoder().fit_transform(df['App'].values)

You may also want to have a look how to handle categorical data within pandas.

However, even after doing this for each non-numeric column in your dataset, there is still a long way before fitting a model on the encoded data (you may want to apply one-hot encoding onto these columns afterwards, but this heavily depends on the model you want to use).

Tobias Windisch
  • 984
  • 5
  • 16
  • Do you mean `scaler['App'] = LabelEncoder().fit_transform(scaler['App'].values)`? – Oo'- Feb 21 '21 at 21:35
  • @GustavoReis: No, the `df` object is the `pandas.DataFrame`, where you data is loaded into, like `df = pd.read_csv('/path/to/googleplaystore.csv')`. My suggestion was, to apply the `LabelEncoder` on to each column that contains string values subsequently. – Tobias Windisch Feb 21 '21 at 21:40
  • It still does not work because the console told `TypeError: float () argument must be a string or a number, not 'module'`. I made a sample and small dataset and the error still insisted. Maybe you can check the very small code at Gist and see if my code is incorrect: [here](https://gist.github.com/gusbemacbe/fe9499a207979961c8e17ced33af013f) – Oo'- Feb 21 '21 at 22:27
  • @GustavoReis: I cannot reproduce the exception. Although some places that could be improved, your Gist runs fine at my place. Its unclear to me what exactly you want to archive with that piece of code (for instance, why you split along `y_indices`). Please consider updating your question to receive more specific help. – Tobias Windisch Feb 22 '21 at 05:43
  • 1
    Thank you, @tobias-windisch. I will remove `StandardScaler` and test it. If it does not work, then it is my teacher's fail. – Oo'- Feb 22 '21 at 05:48
0

StandardScaler is a preprocessing class from sklearn that takes numeric entries and convert them to a likely Gaussian distribution with 0 mean and unit variance. It doesn't deal with text data. That explains the first error.

LabelEncoder is another preprocessing class from sklearn that takes data and maps them to a numeric encoded representation. Ex: ["apple","banana","apple","banana"] to [0,1,0,1]

Your dataset has missing values, you should deal with them first. By means of imputing, droping or some similar approach.

Then you should convert the types (all but rating are considered object or string) from each column to handle properly each datatype.

table = pd.read_csv('googleplaystore.csv')
# check dataset info
table.info()
# check missing values
table.isna().sum()
viniciusrf1992
  • 313
  • 1
  • 7
  • I made a own simple CSV database with 3 rows, it gave the same error. I tested with Google Colaboratory's another simple database containing only numbers and it worked. I understood that source code isn't compatible with databases containing strings. But I do not know to convert. I am just learning. – Oo'- Feb 13 '21 at 21:02
0

To be honest, I think this is more of a conceptual problem than a technical one. As other users told you, StandarScaler must be used on numeric columns but most of your dataframe columns are object type. Probably you should use OneHotEncoder on it, all transformer on sklearn have a similar behaviour.

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit_transform(X)  # your data without target column
# ...blabla...

Finally, I recommend you read about Pipelines from sklearn, I think they are more elegant than a lot of messy code. You can put preprocessing and model steps on the same pipeline, for example here.

  • I use Jupyter Notebook, but unfortunately Jupyter told the error: `AttributeError: 'OneHotEncoder' object has no attribute 'var_'` Maybe you should check [here](https://gist.github.com/gusbemacbe/fe9499a207979961c8e17ced33af013f) and see my code. You do need to reproduce the exception. – Oo'- Feb 23 '21 at 06:42