Simple linear regression using pandas dataframe

Question

I'm looking to check trends for a number of entities (SysNr)

I have data spanning 3 years (2014,2015,2016)

I'm looking at a large quantity of variables, but will limit this question to one ('res_f_r')

My DataFrame looks something like this

d = [
    {'RegnskabsAar': 2014, 'SysNr': 1, 'res_f_r': 350000},
    {'RegnskabsAar': 2015, 'SysNr': 1, 'res_f_r': 400000},
    {'RegnskabsAar': 2016, 'SysNr': 1, 'res_f_r': 450000},
    {'RegnskabsAar': 2014, 'SysNr': 2, 'res_f_r': 350000},
    {'RegnskabsAar': 2015, 'SysNr': 2, 'res_f_r': 300000},
    {'RegnskabsAar': 2016, 'SysNr': 2, 'res_f_r': 250000},
]

df = pd.DataFrame(d)



   RegnskabsAar  SysNr  res_f_r
0          2014      1   350000
1          2015      1   400000
2          2016      1   450000
3          2014      2   350000
4          2015      2   300000
5          2016      2   250000

My desire is to do a linear regression on each entity (SysNr) and get returned the slope and intercept

My desired output for the above is

   SysNr  intercept  slope
0      1     300000  50000
1      2     400000 -50000

Any ideas?

Nico Albers · Accepted Answer · 2018-02-14T12:05:27.757

So I don't know why our intercept values differ (maybe I have made a mistake or your given data is not the full data you expect to work on), but I'd suggest you to use np.polyfit or the tool of your choice (scikit-learn, scipy.stats.linregress, ...) in combination with groupby and apply:

In [25]: df.groupby("SysNr").apply(lambda g: np.polyfit(g.RegnskabsAar, g.res_f_r, 1))
Out[25]:
SysNr
1    [49999.99999999048, -100349999.99998075]
2    [-49999.99999999045, 101049999.99998072]
dtype: object

After that, beautify it:

In [43]: df.groupby("SysNr").apply(
    ...:     lambda g: np.polyfit(g.RegnskabsAar, g.res_f_r, 1)).apply(
    ...:     pd.Series).rename(columns={0:'slope', 1:'intercept'}).reset_index()
Out[43]:
   SysNr    slope     intercept
0      1  50000.0 -1.003500e+08
1      2 -50000.0  1.010500e+08

Edit:

Because you asked on the other answer in the comment about how to handle missing years for some SysNr: Just drop that NaNs for a valid linear regression. Of course you could also fill them with the mean or so, depending on what you want to achieve, but that isn't that helpful from my point of view.

If the entity has only data for one year, you can't apply a linear regression on that usefully. But you can (if you want and that fits your case, please provide more information on the data if needed) extrapolate somehow the slope of the other entities to this one and calculate the intercept. For that of course you must make some assumptions on the distribution of the slope of the entities (e.g. linear, then the slope of sysNr 3 would be -150000.0).

I'm getting the same intercept values you found (different from OP) using sklearn's [linear_model.LinearRegression()](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). — 3novak, Feb 13 '18 at 14:00
The intercept values I found came from Excel trend line, this apparently used 2013 as a base where your figures go to 0. Either is fine, since I'm mostly looking for the slope — Henrik Poulsen, Feb 14 '18 at 07:16

score 3 · Answer 2 · answered Feb 13 '18 at 14:32

You can also use linregress from scipy.stats with groupby from the pandas:

from scipy.stats import linregress

# groupby column
grouped = df.groupby('SysNr')

# https://stackoverflow.com/a/14775604/5916727
# apply linear regression to each group
result_df = pd.DataFrame(grouped.apply(lambda x: linregress(x['RegnskabsAar'], x['res_f_r']))).reset_index()

# https://stackoverflow.com/a/29550458/5916727
# expand result to each column
result_df[['slope', 'intercept', 'r_value', 'p_value', 'std_err']] = result_df[0].apply(pd.Series)

# drop initial column with all in one
del result_df[0]

result_df

Result:

   SysNr    slope    intercept  r_value       p_value  std_err
0      1  50000.0 -100350000.0      1.0  9.003163e-11      0.0
1      2 -50000.0  101050000.0     -1.0  9.003163e-11      0.0

How would I go about dealing with NaN values in this setup?. Some entities only have data in one or two years, leaving the rest NaN. Should I remove these on the SQL side or can we handle this within the linregress function? — Henrik Poulsen, Feb 14 '18 at 08:28
You can try using mask as suggested in https://stackoverflow.com/a/13643460/5916727 — niraj, Feb 14 '18 at 13:57

Simple linear regression using pandas dataframe

2 Answers2

Edit:

Linked