Time series prediction: How do I improve my regression models?

Question

I am analysing Citi Bike for September of 2019 data (time series data) for my dissertation to build predictive regression models. The dataset can be found here. Currently, I'm aggregating the dataset of 2.4 million rows to get the hourly demand of bikes for each station for each day for the whole month. Here's what the aggregation looks like:

I split the dataset using train_test_split and applied various stock learning algorithms, mostly from scikit-learn. However, the results from these models are outputting a very low R2 value. For example, for scikit-learn linear regression, I get an R2 of 0.09019569965308272 hence the models isn't recognizing a pattern in the data. Here is the code for the linear regression model:

def lr(X_train, X_test, y_train, y_test):
    #Create a linear regression object
    reg = LinearRegression()

    sc_X = StandardScaler()

    X_train, X_test = scaleData(X_train, X_test, "robust")

    print(X_train)
    print(X_test)

    reg.fit(X_train, y_train)
    print("reg.score(X, y): {} \n".format(reg.score(X_test, y_test)))
    print("reg.coef_: {} \n".format(reg.coef_))
    print("reg.intercept_: {} \n".format(reg.intercept_))

    pred = reg.predict(X_test)
    print("Pred: {} \n".format(pred))

    results = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': pred.flatten()})
    print(results)

    PlotResultsGetPerformance(results)

lr(X_train, X_test, y_train, y_test)

The scaledata method:

def scaleData(X_train, X_test, scalingType, X=None):
    scaler = []
    stype = scalingType.lower()
    if stype == "standard":
        scaler = StandardScaler()
    elif stype == "minmax":
        scaler = MinMaxScaler()
    elif stype == "robust":
        scaler = RobustScaler()

    if X == None:
        scaler = scaler.fit(X_train)

        X_train = scaler.transform(X_train)
        X_test = scaler.transform(X_test)

        return X_train, X_test
    else:
        X = scaler.fit_transform(X)

        return X

Output of running linear regression algorithm:

reg.score(X, y): 0.09019569965308272 
reg.coef_: [[-4.71123839  0.87411394 -0.1425281   1.33332683]] 
reg.intercept_: [4.94247875] 

Pred: [[ 4.16553018]
 [10.71438879]
 [ 5.21549358]
 ...
 [10.23551752]
 [ 4.94370368]
 [ 4.10551935]]

Mean Absolute Error: 4.833603206597555
Mean Squared Error: 61.94697363477656
Root Mean Squared Error: 7.870639976188503
R2: 0.09019569965308272

What seems to be the problem? Is it a data issue or a model issue? Any help would be appreciated.

R2 is not a very meaningful measure for predictive problems (see the last part of [this answer](https://stackoverflow.com/questions/54614157/scikit-learn-statsmodels-which-r-squared-is-correct/54618898#54618898)); I suggest you change to something appropriate (MSE, RMSE, MAE). — desertnaut, Apr 13 '20 at 13:36
@desertnaut I also have those values but forgot to add them to the post. Added now — Julian P, Apr 13 '20 at 13:50

score 0 · Accepted Answer · answered Apr 13 '20 at 13:41

In this case, I think your model is suffering under-fitting (high bias). It looks there is not enough information in the explanatory variables. You need more columns in your data. You can make new columns or try to find more explanatory variables and add them to the model. Also, please check the quality of the variables you already have, maybe some transformations can help to improve the predictions. Here a more detailed explanation: https://towardsdatascience.com/what-are-overfitting-and-underfitting-in-machine-learning-a96b30864690

Time series prediction: How do I improve my regression models?

1 Answers1