I am trying to calculate the regression coefficient of weight
for every animal_id
and cycle_nr
in my df:
animal_id | cycle_nr | feed_date | weight |
---|---|---|---|
1003 | 8 | 2020-02-06 | 221 |
1003 | 8 | 2020-02-10 | 226 |
1003 | 8 | 2020-02-14 | 230 |
1004 | 1 | 2020-02-20 | 231 |
1004 | 1 | 2020-02-21 | 243 |
What I tried using this source source:
import pandas as pd
import statsmodels.api as sm
def GroupRegress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
result = df.groupby(['animal_id', 'cycle_nr']).apply(GroupRegress, 'feed_date', ['weight'])
This code fails because my variable includes a date.
What I tried next:
I figured I could create a numeric column to use instead of my date column. I created a simple count_id column:
animal_id | cycle_nr | feed_date | weight | id |
---|---|---|---|---|
1003 | 8 | 2020-02-06 | 221 | 1 |
1003 | 8 | 2020-02-10 | 226 | 2 |
1003 | 8 | 2020-02-14 | 230 | 3 |
1004 | 1 | 2020-02-20 | 231 | 4 |
1004 | 1 | 2020-02-21 | 243 | 5 |
Then I ran my regression on this column
result = df.groupby(['animal_id', 'cycle_nr']).apply(GroupRegress, 'id', ['weight'])
The slope calculation looks good, but the intercept makes of course no sense.
Then I realized that this method is only useable when the interval between measurements is regular. In most cases the interval is 7 days, but somethimes it is 10, 14 or 21 days.
I dropped records where the interval was not 7 days and re-ran my regression...It works, but I hate that I have to throw away perfectly fine data.
I'm wondering if there is a better approach where I can either include the date in my regression or can correct for the varying intervals of my dates. Any suggestions?