I am new to machine learning and I am building my first model independently. I have a dataset that evaluates cars, it contains features of price, safety and luxury and classifies if its good, very good, acceptable and unacceptable. I converted all the non-numeric columns into numeric, trained the data and predicted with a test set. However, my predictions are awful; I used LinearRegression and r2_score outputs 0.05 which is practically 0. I have tried a few different models and all have been giving me horrible predictions and accuracy.
What am I doing wrong? I have seen tutorials, read articles with similar methodology, yet they end up with 0.92 accuracy and I'm getting 0.05. How do you make a good model for your data and how do you know which model to use?
Code:
import numpy as np
import pandas as pd
from sklearn import preprocessing, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class value']
df = pd.read_csv('car.data.txt', index_col=False, names=columns)
for col in df.columns.values:
try:
if df[col].astype(int):
pass
except ValueError:
enc = preprocessing.LabelEncoder()
enc.fit(df[col])
df[col] = enc.transform(df[col])
#Split the data
class_y = df.pop('class value')
x_train, x_test, y_train, y_test = train_test_split(df, class_y, test_size=0.2, random_state=0)
#Make the model
regression_model = linear_model.LinearRegression()
regression_model = regression_model.fit(x_train, y_train)
#Predict the test data
y_pred = regression_model.predict(x_test)
score = r2_score(y_test, y_pred)