I am analysing Citi Bike for September of 2019 data (time series data) for my dissertation to build predictive regression models. The dataset can be found here. Currently, I'm aggregating the dataset of 2.4 million rows to get the hourly demand of bikes for each station for each day for the whole month. Here's what the aggregation looks like:
I split the dataset using train_test_split and applied various stock learning algorithms, mostly from scikit-learn. However, the results from these models are outputting a very low R2 value. For example, for scikit-learn linear regression, I get an R2 of 0.09019569965308272 hence the models isn't recognizing a pattern in the data. Here is the code for the linear regression model:
def lr(X_train, X_test, y_train, y_test):
#Create a linear regression object
reg = LinearRegression()
sc_X = StandardScaler()
X_train, X_test = scaleData(X_train, X_test, "robust")
print(X_train)
print(X_test)
reg.fit(X_train, y_train)
print("reg.score(X, y): {} \n".format(reg.score(X_test, y_test)))
print("reg.coef_: {} \n".format(reg.coef_))
print("reg.intercept_: {} \n".format(reg.intercept_))
pred = reg.predict(X_test)
print("Pred: {} \n".format(pred))
results = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': pred.flatten()})
print(results)
PlotResultsGetPerformance(results)
lr(X_train, X_test, y_train, y_test)
The scaledata
method:
def scaleData(X_train, X_test, scalingType, X=None):
scaler = []
stype = scalingType.lower()
if stype == "standard":
scaler = StandardScaler()
elif stype == "minmax":
scaler = MinMaxScaler()
elif stype == "robust":
scaler = RobustScaler()
if X == None:
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
return X_train, X_test
else:
X = scaler.fit_transform(X)
return X
Output of running linear regression algorithm:
reg.score(X, y): 0.09019569965308272
reg.coef_: [[-4.71123839 0.87411394 -0.1425281 1.33332683]]
reg.intercept_: [4.94247875]
Pred: [[ 4.16553018]
[10.71438879]
[ 5.21549358]
...
[10.23551752]
[ 4.94370368]
[ 4.10551935]]
Mean Absolute Error: 4.833603206597555
Mean Squared Error: 61.94697363477656
Root Mean Squared Error: 7.870639976188503
R2: 0.09019569965308272
What seems to be the problem? Is it a data issue or a model issue? Any help would be appreciated.