I have a dataset with many columns:
There are 4 variables used for prediction: -season (sum, aut,win,spr) -express_shipment (true, False) -shipping_distance ( in KM) -first_time_customer ( true, false)
These 4 variables are used to calculate the shipping_price, with the following rule, for each season, there is a separate model that uses the above mentioned variables.
I have used an approach where, I converted True to 1 and False to 0 for the 2 Boolean columns I also converted the season in to an integer representation (1,2,3,4)
The problem is my predictions are wildly inaccurate, here is the code i am using
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
modeling = data.loc[:,["shipping_distance","season_int","new_cust_int","express_shipment","shipping_charge"]]
x =modeling.iloc[:,:-1]
y =modeling.iloc[:,-1:]
X_train, X_test, y_train, y_test = train_test_split(x,y, random_state = 1)
model = LinearRegression()
model.fit(X_train, y_train)
model.predict(X_test)
Is anyone able to explain what the correct approach to this problem is, and or how to solve it?