0

I'm trying to do simple linear regression using this small Dataset (Screenshot).

The dataset is records divided into small time blocks of 4 years each (Except for the 2nd to the last time block of 2016-2018).

What I'm trying to do is try to predict the output of records for the timeblock of 2019-2022. To do this, I placed a 2019-2022 time block with all its rows containing the value of 0 (Since there's nothing made during that time since it's the future). I did that to accommodate the syntax of sklearn's train_test_split and went with this code:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

df = pd.read_csv("TCO.csv")
df = df[['2000-2003', '2004-2007', '2008-2011','2012-2015','2016-2018','2019-2022']]
linreg = LinearRegression()
X1_train, X1_test, y1_train, y1_test = train_test_split(df[['2000-2003','2004-2007','2008-2011',
'2012-2015','2016-2018']],df['2019-2022'],test_size=0.4,random_state = 42)

linreg.fit(X1_train, y1_train)
linreg.intercept_
list( zip( ['2000-2003','2004-2007','2008-2011','2012-2015','2016-2018'],list(linreg.coef_)))

y1_pred = linreg.predict(X1_test)
print(y1_pred)

test_pred_df = pd.DataFrame({'actual': y1_test,
                          'predicted': np.round(y1_pred, 2),
                          'residuals': y1_test - y1_pred})

print(test_pred_df[0:10].to_string())

For some reason, the algorithm would always return a 0 as the final prediction for all rows with 0 residuals (This is due to the timeblock of 2019-2022 having all rows of zero.)

I think I did something wrong but I can't tell what it is. (I'm a beginner in this topic.) Can someone point out what went wrong and how to fix it?

Edit: I added a copy-able version of the data:

      df = pd.DataFrame( {'Country:':['Brunei','Cambodia','Indonesia','Laos',
                             'Malaysia','Myanmar','Philippines','Singaore',
                             'Thailand','Vietnam'],
                 '2000-2003': [0,0,14,1,6,0,25,8,26,8],
                 '2004-2007': [0,3,15,6,21,0,37,11,44,36],
                 '2008-2011': [0,5,31,9,75,0,58,27,96,61],
                 '2012-2015': [5,11,129,35,238,3,99,65,170,96],
                 '2016-2018': [6,22,136,17,211,10,66,89,119,88]})
  • Edit: Re-uploaded Dataset Screenshot because it caused misunderstandings. – Ace Ellasos Jun 28 '18 at 06:17
  • Can you provide the data in a copyable format? Read [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for instance. – Marcus V. Jun 28 '18 at 06:40
  • Anyhow, your problem seems to be a time-series forecasting problem now. So you may actually consider each series by itself and add lagged versions of the same series as feature (using `shift()`). But you may also consider completely different methdods from time series forecasting – Marcus V. Jun 28 '18 at 06:51
  • @MarcusV. I added a copyable version of the data. How do is use shift( ) in this case? How would it relate to linear regression? – Ace Ellasos Jun 28 '18 at 07:50
  • See my answer, I think this is what you asked for. Using lagged variables may only make sense if you have longer periods, or add other features. – Marcus V. Jun 28 '18 at 09:51

1 Answers1

0

Based on your data, I think this is what you ask for [Edit: see updated version below]:

import pandas as pd
from sklearn.linear_model import LinearRegression

df = pd.DataFrame( {'Country:':['Brunei','Cambodia','Indonesia','Laos',
                             'Malaysia','Myanmar','Philippines','Singaore',
                             'Thailand','Vietnam'],
                 '2000-2003': [0,0,14,1,6,0,25,8,26,8],
                 '2004-2007': [0,3,15,6,21,0,37,11,44,36],
                 '2008-2011': [0,5,31,9,75,0,58,27,96,61],
                 '2012-2015': [5,11,129,35,238,3,99,65,170,96],
                 '2016-2018': [6,22,136,17,211,10,66,89,119,88]})

# create a transposed version with country in header
df_T = df.T
df_T.columns = df_T.iloc[-1]
df_T = df_T.drop("Country:")

# create a new columns for target
df["2019-2022"] = np.NaN

# now fit a model per country and add the prediction
for country in df_T:
    y = df_T[country].values
    X = np.arange(0,len(y))
    m = LinearRegression()
    m.fit(X.reshape(-1, 1), y)
    df.loc[df["Country:"] == country, "2019-2022"] = m.predict(5)[0]

This prints:

Country:        2000-2003   2004-2007   2008-2011   2012-2015   2016-2018   2019-2022
Brunei  0   0   0   5   6   7.3
Cambodia    0   3   5   11  22  23.8
Indonesia   14  15  31  129 136 172.4
Laos    1   6   9   35  17  31.9
Malaysia    6   21  75  238 211 298.3
Myanmar 0   0   0   3   10  9.5
Philippines 25  37  58  99  66  100.2
Singaore    8   11  27  65  89  104.8
Thailand    26  44  96  170 119 184.6
Vietnam 8   36  61  96  88  123.8

Forget about my comment with shift(). I thought about it, but it makes not sense for this small amount of data, I think. But considering time series methods and treating each country's series as a time series may still be worth for you.

Edit:

Excuse me. The above code is unnessary complicated, but was just result of me going through it step by step. Of course it can simply be done row by row like tihs:

import pandas as pd
from sklearn.linear_model import LinearRegression

df = pd.DataFrame( {'Country:':['Brunei','Cambodia','Indonesia','Laos',
                             'Malaysia','Myanmar','Philippines','Singaore',
                             'Thailand','Vietnam'],
                 '2000-2003': [0,0,14,1,6,0,25,8,26,8],
                 '2004-2007': [0,3,15,6,21,0,37,11,44,36],
                 '2008-2011': [0,5,31,9,75,0,58,27,96,61],
                 '2012-2015': [5,11,129,35,238,3,99,65,170,96],
                 '2016-2018': [6,22,136,17,211,10,66,89,119,88]})

# create a new columns for target
df["2019-2022"] = np.NaN

for idx, row in df.iterrows():
    y = row.drop(["Country:", "2019-2022"]).values
    X = np.arange(0,len(y))
    m = LinearRegression()
    m.fit(X.reshape(-1, 1), y)
    df.loc[idx, "2019-2022"] = m.predict(len(y)+1)[0]

1500 rows should be no problem.

Marcus V.
  • 6,323
  • 1
  • 18
  • 33
  • I love how detailed this is, is there any way I can run this using pandas? Because I have to run it again using a csv file with a 1500+ rows of the same formated data as the screenshot. – Ace Ellasos Jun 28 '18 at 09:58
  • see edit. Sorry, that was unnecessary complicated :) – Marcus V. Jun 28 '18 at 10:32
  • When I try to change the values (for another dataset, same size and structure), I get the error: "ValueError: Must have equal len keys and value when setting with an iterable". I can't tell what's causing this, is there something causing this? – Ace Ellasos Jun 28 '18 at 10:44
  • It was the hard-coded 5 in the last line. Edit should do. – Marcus V. Jun 28 '18 at 10:47