0

I have a dataset with many columns:

There are 4 variables used for prediction: -season (sum, aut,win,spr) -express_shipment (true, False) -shipping_distance ( in KM) -first_time_customer ( true, false)

These 4 variables are used to calculate the shipping_price, with the following rule, for each season, there is a separate model that uses the above mentioned variables.

I have used an approach where, I converted True to 1 and False to 0 for the 2 Boolean columns I also converted the season in to an integer representation (1,2,3,4)

The problem is my predictions are wildly inaccurate, here is the code i am using

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
modeling = data.loc[:,["shipping_distance","season_int","new_cust_int","express_shipment","shipping_charge"]]
x =modeling.iloc[:,:-1]
y =modeling.iloc[:,-1:]
X_train, X_test, y_train, y_test = train_test_split(x,y, random_state = 1)
model = LinearRegression()
model.fit(X_train, y_train)
model.predict(X_test)

Is anyone able to explain what the correct approach to this problem is, and or how to solve it?

Pythonuser
  • 203
  • 1
  • 11

3 Answers3

0

Here you use label encoder for "season_int" (1,2,3,4) and the linear regression. That means you assign the "season_int" some intrinsic order for this model. You could try one hot encoding for "season_int":

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

jhihan
  • 499
  • 4
  • 8
0

Possible answers:

  • You are using categorical variables for linear regression, which might be an issue. Here are possible solutions.
  • LinearRegression might not be the best model for your problem, since your problem might not be linear. Try non-linear models such as sklearn.ensemble.RandomForestRegressor for example.
  • Your dataset might not be valuable enough for the problem you are trying to solve. The variables might not be the best ones to determine the price etc.
  • You don't have enough data to train your model.
felice
  • 1,185
  • 1
  • 13
  • 27
-1

It seems like you want a time series model [do you ?] https://www.statsmodels.org/stable/examples/index.html#time-series-analysis

辜乘风
  • 152
  • 5