0

I've been looking for the most current method to create a linear regression model given a Pandas Dataframe.

DF looks like:

+---------------------+-------------+--------------------+--------------------+
|        Date         | YearWeekNum | Dependent_Variable | Bonus_Grouping_Int |
+---------------------+-------------+--------------------+--------------------+
| 2017-07-01 00:12:07 | 2017-Wk26   |               35.4 |                  1 |
| 2017-07-01 00:12:07 | 2017-Wk26   |               33.3 |                  2 |
| 2018-01-05 25:12:07 | 2018-Wk0    |               28.2 |                  1 |
| 2018-01-05 25:12:07 | 2018-Wk0    |               24.2 |                  2 |
+---------------------+-------------+--------------------+--------------------+

I've created the YearWeekNum column with:

df['YearWeekNum'] = df['Date'].dt.strftime('%Y-Wk%U')

I'd love to be able to create a linear regression that uses the YearWeekNum as the independent (predictor) variable and the Dependent Variable as (you guessed it) the dependent (response) variable. In the end, a plot that looks like this:

Desired plot

I tried this question, by using result = sm.ols(formula="Dependent_Variable ~ YearWeekNum", data=df).fit(), but it creates a model with each YearWeekNum as its own independent variable (doing a regression for each week period.

From this one, I also tried:

from pandas.stats.api import ols

but got:

ImportError: cannot import name 'ols'

It seems like the ols has been deprecated. So, my question is: How can I run a linear regression on a dataframe by Year and Week Number as the independent variable using Pandas?

Cherry on top: would be creating two regression models based on the grouping int (red line is values with Grouping int 1 and indigo line is values with Grouping int 2)

Thanks in advance!

HSchmachty
  • 307
  • 1
  • 14

1 Answers1

0

This is the "solution" that I was able to make work:

First, I only wanted weeks 1-52, not to include 0 or 53.

df['YearWeekNum'] = df['Date'].dt.strftime('%Y-Wk%U')
df.loc[df['YearWeekNum'].str.contains('Wk53') == True, 'YearWeekNum'] = '2017-Wk52'
df.loc[df['YearWeekNum'].str.contains('Wk00') == True, 'YearWeekNum'] = '2018-Wk01'

Then, I created a column that would group all the dates sequentially by the week of the year using the dt.to_period functionality:

df['time_period'] = df['instrumentstartedon'].dt.to_period(freq='W')

This is where it gets a little roundabout. First, create an ordered set of the time periods by week:

dictionary_of_time_periods = dict()
set_of_periods = set(df['time_period'])
ordered_list_of_set = list(set_of_periods)
ordered_list_of_set.sort()

Second, create a dictionary where the chronologically ordered time periods are given sequential numbers:

index_key = 0

# The following loop creates a dictionary of each time period (weeks by default)
# which is used to create a consecutive sequence (1,n) for each week.
# This dictionary is passed into the "apply_order" function which adds the column
# to the DataFrame
for t_period_pair in ordered_list_of_set:
    this_per = ordered_list_of_set[index_key]
    dictionary_of_time_periods[this_per] = (index_key + 1)
    index_key += 1

Last, add a new column to the dataframe where each data point is given the number (0,n) from the ordered dictionary:

df['ordered_nums'] = df.apply(lambda to_column: apply_order(to_column['time_period'], dictionary_of_time_periods),
                              axis=1)

Where the function apply_order is simply a dictionary lookup:

def apply_order(df_like, dictionary_of_timeframes):
    return dictionary_of_timeframes[df_like]

Then, for the linear regression:

import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

plt.style.use('ggplot')
regression_result = smf.ols(formula='Dependent_Variable ~ ordered_nums', data=df).fit()
print(regression_result.summary())
print(regression_result.params)

regression_intercept = regression_result.params[0]
regression_slope = regression_result.params[1]

n_points = len(set(df['YearWeekNum']))
plot_x_array = []
for inty in range(0, (n_points + 2)):
    plot_x_array += [inty]

ols_regression_y_hat = [regression_slope * i + regression_intercept for i in plot_x_array]
ax.plot(plot_x_array, ols_regression_y_hat, c='xkcd:violet', label='Linear Regression')
fig.legend()

I hope this helps someone sometime!

HSchmachty
  • 307
  • 1
  • 14