I was learning the MPL regressor at Google Colaboratory and ran the source code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data = np.array(table)
scaler.fit(data)
y_index = data.shape[1]-1
sd_x = (scaler.var_[:y_index])**0.5
sd_y = (scaler.var_[y_index])**0.5
mean_x = scaler.mean_[:y_index]
mean_y = scaler.mean_[y_index]
x = (data[:, :y_index]).astype(np.float32)
y = (data[:, y_index]).astype(np.float32)
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.25)
print('Separate training and testing sets!')
It gave the error ValueError: could not convert string to float: 'Photo Editor & Candy Camera & Grid & ScrapBook'
.
So I checked the question RandomForestClassfier.fit(): ValueError: could not convert string to float. I also tried sklearn-LinearRegression: could not convert string to float: '--'.
I changed from fit(data)
to fit_transform(data)
, but the same error still insisted. Then I changed from StandardScaler
to LabelEncoder
, and from scaler = StandardScaler()
to scaler = LabelEncoder()
. But the different error appeared: ValueError: bad input shape (10841, 13)
on the line scaler.fit_transform(data)
.
You can check the CSV from Kaggle's CSV here. The CSV contains both strings and numbers without quotation marks (except the prices which contain double quotation marks).