0

Context:

I have a data frame similar to this, except that it extends over decades of data:

df = pd.DataFrame({'time':['2003-02-02', '2003-02-03', '2003-02-04', '2003-02-05', '2003-02-06', '2003-02-07', '2003-02-08', '2003-02-09','2003-02-10', '2003-02-11'], 'NDVI': [0.505413, 0.504566, 0.503682, 0.502759, 0.501796, 0.500791, 0.499743, 0.498651, 0.497514, 0.496332]})
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d')
df.set_index('time', inplace=True)
Output:
                NDVI
time                
2003-02-02  0.505413
2003-02-03  0.504566
2003-02-04  0.503682
2003-02-05  0.502759
2003-02-06  0.501796
2003-02-07  0.500791
2003-02-08  0.499743
2003-02-09  0.498651
2003-02-10  0.497514
2003-02-11  0.496332

Problem:

I would like to:

  1. Get the mean NDVI value at a custom time interval that starts from the beginning of every year. If the interval is e.g. 10 days, values will be binned as [Jan-1 : Jan-10], [Jan-11 : Jan-20] etc. The last interval of the year will have to be either a 5- or 6-day interval depending on being a leap year (i.e. 360th-365/6th day of the year).
  2. Add a column for the corresponding interval number, so the output would be something similar to this:
                NDVI  yr_interval
time                
2003-01-31  0.505413            4
2003-02-10  0.497514            5

In the above example, the first line represents the 4th 10-day interval of year 2003.

Question:

How to implement that, knowing that:

  • For time series spanning several years, the interval number should restart at every year (a similar behaviour to pandas.Series.dt.week)?
  • That the code should be flexible enough to test other time intervals (e.g. 8 days)?
e5k
  • 194
  • 1
  • 14
  • What have you tried so far? – cwalvoort Mar 17 '20 at 12:15
  • I honestly don’t know where to start. – e5k Mar 17 '20 at 12:49
  • Try starting here, then post any code you need help with: https://stackoverflow.com/a/30328738/3923163 – cwalvoort Mar 17 '20 at 13:09
  • Thanks. I can group or resample, but I don't see how that helps achieving what I need, which is specifically i) creating time intervals that start from the beginning of the year and restart each year and ii) ranging the intervals throughout the year. The closest analogy I can find is `pandas.Series.dt.week` except for a custom number of days instead of 7. – e5k Mar 17 '20 at 13:20

2 Answers2

2

How about trying pandas.Series.dt.dayofyear and divide that result by the interval you want? This would be equivalent to pandas.Series.dt.week if you used 7 as your interval.

The proof is left as an exercise for the reader.

cwalvoort
  • 1,851
  • 1
  • 18
  • 19
  • 1
    This is a good one!!! `df['yr_interval'] = df.index.dayofyear // 10` - Should work great. +1 – davidbilla Mar 17 '20 at 18:53
  • 1
    Thanks for this valuable lesson ^_^. It is so easy to get stuck in your bubble of functions and loose the perspective of simple logics! Thanks both, ended up using a `ceil` rather than `floor` but it works like charm! – e5k Mar 18 '20 at 01:09
  • @e5k very true :) – davidbilla Mar 18 '20 at 13:29
1

This should work. I don't know if there is any other efficient way to do this, but here is what it looks like. Here I have grouped the the dates by year and then group each year group by the time interval of 10 days and get the index of each 10 day group within the year group - so that the index resets for every year group

Sample data frame - 'time' col has date values from 01/01/2005 to 12/31/2007 and 'NDVI' col has some random values.

np.random.seed(123)
df = pd.DataFrame({'time': pd.date_range('2005-01-01', '2007-12-31', freq='1D'),
                   'NDVI': np.random.randn(1095)})
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d')
df.set_index('time', inplace=True)

The function assign_group_interval takes each group (as df) and groups them by the time interval of 10 days and gets the yr_interval value from the index of the grouped df. This function will be applied to each year group so that the identifier resets for each year group.

def assign_group_interval(year_grp):
    year_grp['period'] = year_grp.groupby(pd.Grouper(freq='10D'), as_index=False)\
        .apply(lambda x: x['NDVI'])\
        .index.get_level_values(0)
    return year_grp

Group the df by year and apply the function assign_group_interval

print(df.groupby(df.index.year).apply(assign_group_interval).reset_index())

Output:

           time      NDVI  period
0    2005-01-01 -1.085631       0
1    2005-01-02  0.997345       0
2    2005-01-03  0.282978       0
3    2005-01-04 -1.506295       0
4    2005-01-05 -0.578600       0
5    2005-01-06  1.651437       0
6    2005-01-07 -2.426679       0
7    2005-01-08 -0.428913       0
8    2005-01-09  1.265936       0
9    2005-01-10 -0.866740       0
10   2005-01-11 -0.678886       1
11   2005-01-12 -0.094709       1
12   2005-01-13  1.491390       1
13   2005-01-14 -0.638902       1
14   2005-01-15 -0.443982       1
15   2005-01-16 -0.434351       1
16   2005-01-17  2.205930       1
17   2005-01-18  2.186786       1
18   2005-01-19  1.004054       1
19   2005-01-20  0.386186       1
20   2005-01-21  0.737369       2
21   2005-01-22  1.490732       2
22   2005-01-23 -0.935834       2
23   2005-01-24  1.175829       2
24   2005-01-25 -1.253881       2
25   2005-01-26 -0.637752       2
...
...
455  2006-04-01  0.104061       9
456  2006-04-02  0.165957       9
457  2006-04-03  1.601908       9
458  2006-04-04  0.058687       9
459  2006-04-05  1.064423       9
460  2006-04-06 -0.039329       9
461  2006-04-07  1.448904       9
462  2006-04-08 -1.870397       9
463  2006-04-09 -0.598732       9
464  2006-04-10  0.983033       9
465  2006-04-11 -0.171596      10
466  2006-04-12  0.931530      10
467  2006-04-13  0.385066      10
468  2006-04-14  0.945877      10
469  2006-04-15  0.613068      10
470  2006-04-16  0.673649      10
471  2006-04-17  1.492455      10
472  2006-04-18  0.986474      10
473  2006-04-19  0.993807      10
474  2006-04-20  0.020419      10
475  2006-04-21 -0.581850      11
476  2006-04-22 -0.659560      11
477  2006-04-23  0.750945      11
478  2006-04-24 -2.438461      11
479  2006-04-25 -1.307178      11
480  2006-04-26 -0.963254      11
...
...
830  2007-04-11 -0.482365      10
831  2007-04-12  1.079796      10
832  2007-04-13 -0.421079      10
833  2007-04-14 -1.166471      10
834  2007-04-15  0.856555      10
835  2007-04-16 -0.017391      10
836  2007-04-17  1.448577      10
837  2007-04-18  0.892200      10
838  2007-04-19 -0.229427      10
839  2007-04-20 -0.449668      10
840  2007-04-21  0.023372      11
841  2007-04-22  0.190210      11
842  2007-04-23 -0.881749      11
843  2007-04-24  0.841940      11
844  2007-04-25 -0.397363      11
845  2007-04-26 -0.423028      11
846  2007-04-27 -0.540688      11
847  2007-04-28  0.231017      11
848  2007-04-29 -0.692053      11
849  2007-04-30  0.134970      11
850  2007-05-01  2.766603      12
851  2007-05-02 -0.053609      12
852  2007-05-03 -0.434005      12
853  2007-05-04 -1.667689      12
854  2007-05-05  0.050222      12
855  2007-05-06 -1.109231      12
...
...
1070 2007-12-07  0.923065      34
1071 2007-12-08 -0.822000      34
1072 2007-12-09  1.607085      34
1073 2007-12-10  0.737825      34
1074 2007-12-11 -0.403760      34
1075 2007-12-12 -2.114548      34
1076 2007-12-13 -0.000311      34
1077 2007-12-14 -1.181809      34
1078 2007-12-15  0.299635      34
1079 2007-12-16  1.451169      34
1080 2007-12-17  0.160060      35
1081 2007-12-18 -0.178013      35
1082 2007-12-19  0.342205      35
1083 2007-12-20  0.285650      35
1084 2007-12-21 -2.362864      35
1085 2007-12-22  0.240937      35
1086 2007-12-23  0.620277      35
1087 2007-12-24 -0.259342      35
1088 2007-12-25  0.978559      35
1089 2007-12-26 -0.127675      35
1090 2007-12-27  0.766999      36
1091 2007-12-28  2.273105      36
1092 2007-12-29 -0.096391      36
1093 2007-12-30 -1.942132      36
1094 2007-12-31 -0.336592      36

@cwalvoort's solution is better

df['yr_interval'] = df.index.dayofyear // 10
davidbilla
  • 2,120
  • 1
  • 15
  • 26