how to run regression on groups with dates

Question

I am trying to calculate the regression coefficient of weight for every animal_id and cycle_nr in my df:

animal_id	cycle_nr	feed_date	weight
1003	8	2020-02-06	221
1003	8	2020-02-10	226
1003	8	2020-02-14	230
1004	1	2020-02-20	231
1004	1	2020-02-21	243

What I tried using this source source:

import pandas as pd
import statsmodels.api as sm 


def GroupRegress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params

result = df.groupby(['animal_id', 'cycle_nr']).apply(GroupRegress, 'feed_date', ['weight'])

This code fails because my variable includes a date.

What I tried next:

I figured I could create a numeric column to use instead of my date column. I created a simple count_id column:

animal_id	cycle_nr	feed_date	weight	id
1003	8	2020-02-06	221	1
1003	8	2020-02-10	226	2
1003	8	2020-02-14	230	3
1004	1	2020-02-20	231	4
1004	1	2020-02-21	243	5

Then I ran my regression on this column

result = df.groupby(['animal_id', 'cycle_nr']).apply(GroupRegress, 'id', ['weight'])

The slope calculation looks good, but the intercept makes of course no sense.

Then I realized that this method is only useable when the interval between measurements is regular. In most cases the interval is 7 days, but somethimes it is 10, 14 or 21 days.

I dropped records where the interval was not 7 days and re-ran my regression...It works, but I hate that I have to throw away perfectly fine data.

I'm wondering if there is a better approach where I can either include the date in my regression or can correct for the varying intervals of my dates. Any suggestions?

Related, possible duplicate: [Can I plot a linear regression with datetimes on the x-axis with seaborn?](https://stackoverflow.com/questions/29308729/can-i-plot-a-linear-regression-with-datetimes-on-the-x-axis-with-seaborn), ... [Regression with Date variable using Scikit-learn](https://stackoverflow.com/questions/16453644/regression-with-date-variable-using-scikit-learn), ... [linear regression for timeseries python (numpy or pandas)](https://stackoverflow.com/questions/32327471/linear-regression-for-timeseries-python-numpy-or-pandas) — wwii, Jul 16 '21 at 14:27

wwii · Answer 1 · 2021-07-16T14:55:58.540

I'm wondering if there is a better approach where I can either include the date in my regression or can correct for the varying intervals of my dates.

If the feed dates are strings make a datetime Series using pandas.to_datetime.
Use that new Series to calculate the actual time difference between feedings
Use the resultant timedeltas in your regression instead of a linear fabricated sequence. The timedeltas have different attributes, (i.e. microseconds, days), that can be used depending on the resolution you need.

My first instinct would be to produce the Timedeltas for each group separately. The first feeding in each group would of course be time zero.

Making the Timedeltas may not even be necessary - there are probably datetime aware regression methods in Numpy or Scipy or maybe even Pandas - I imagine there would have to be, it is a common enough application.

Instead of Timedeltas the datetime Series could be converted to ordinal values for use in the regression.

df = pd.DataFrame(
    {
        "feed_date": [
            "2020-02-06",
            "2020-02-10",
            "2020-02-14",
            "2020-02-20",
            "2020-02-21",
        ]
    }
)


>>> q = pd.to_datetime(df.feed_date)
>>> q
0   2020-02-06
1   2020-02-10
2   2020-02-14
3   2020-02-20
4   2020-02-21
Name: feed_date, dtype: datetime64[ns]
>>> q.apply(pd.Timestamp.toordinal)
0    737461
1    737465
2    737469
3    737475
4    737476
Name: feed_date, dtype: int64
>>>

That is actually a good point. Converting my dates to ordinal values is actually a good solution. I will give that a go. I will also have a look at the time delta solution. I like for my data to remain readible, so this would be a nice solution. Thanks for your feedback. — brenda89, Jul 16 '21 at 17:36

how to run regression on groups with dates

1 Answers1