0

I am new to machine learning and I am building my first model independently. I have a dataset that evaluates cars, it contains features of price, safety and luxury and classifies if its good, very good, acceptable and unacceptable. I converted all the non-numeric columns into numeric, trained the data and predicted with a test set. However, my predictions are awful; I used LinearRegression and r2_score outputs 0.05 which is practically 0. I have tried a few different models and all have been giving me horrible predictions and accuracy.

What am I doing wrong? I have seen tutorials, read articles with similar methodology, yet they end up with 0.92 accuracy and I'm getting 0.05. How do you make a good model for your data and how do you know which model to use?

Code:

import numpy as np
import pandas as pd
from sklearn import preprocessing, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500) 
pd.set_option('display.width', 1000)

columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class value']
df = pd.read_csv('car.data.txt', index_col=False, names=columns)

for col in df.columns.values:
    try:
        if df[col].astype(int):
            pass
    except ValueError:
        enc = preprocessing.LabelEncoder()
        enc.fit(df[col])
        df[col] = enc.transform(df[col])

#Split the data
class_y = df.pop('class value')
x_train, x_test, y_train, y_test = train_test_split(df, class_y, test_size=0.2, random_state=0)

#Make the model
regression_model = linear_model.LinearRegression()
regression_model = regression_model.fit(x_train, y_train)

#Predict the test data
y_pred  = regression_model.predict(x_test)

score = r2_score(y_test, y_pred)
desertnaut
  • 57,590
  • 26
  • 140
  • 166
David
  • 153
  • 1
  • 14
  • As already answered, you are using an inappropriate model (regression) for your problem (classification); check the scikit-learn docs for available classiication models (notice that logistic regression, despite its name, is a classification model, as suggested in the answer below). My answer [here](https://stackoverflow.com/questions/38015181/accuracy-score-valueerror-cant-handle-mix-of-binary-and-continuous-target/54458777#54458777) might be generally helpful (PS please accept the answer below, as it is essentially the correct one) – desertnaut Mar 12 '19 at 19:48

1 Answers1

4

You should not use Linear Regression, which is used for predicting continuous values and not categorical values. In your case what you are trying to predict is categorical. Technically, each situation is a class.

I would suggest trying Logistic Regression or other type of classification methods such as Naive Bayes, SVM , decision tree classifiers etc. instead.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
melowgs
  • 420
  • 1
  • 4
  • 13
  • I switched to Logistic Regression, and it raised the accuracy to 65%. Is the accuracy dependent on the model or how I use it? I included the code above, is there a way to improve the model? Thanks – David Mar 12 '19 at 20:35
  • 2
    @David The accuracy is dependent on many things including the model you choose, the way you use it, the data and even how you preprocess the same data before you use. Unfortunately, there is no shortcuts to just write and get better accuracy. You should try different models with different parameters. Checking the error curve might also give you ideas about how to proceed or what might be the problem. – melowgs Mar 12 '19 at 20:53
  • @David Please do *not* update questions in such a manner, which changes completely the context and nearly invalidates the given answers! You are very welcome to open a new question if required (edited & removed the added logistic regression part). – desertnaut Mar 13 '19 at 01:12