0

I have some hierarchical data from 2003 to 2011 which bottoms out into time series data which looks something like this:

polar_temp
         Station_Number      Date  Value
417         CA002100805  20030101   -296
423         CA002202570  20030101   -269
425         CA002203058  20030101   -268
427         CA002300551  20030101    -23
428         CA002300902  20030101   -200

I set a multi index on Station_Number and Date:

polar_temp['Date'] = pd.to_datetime(polar_temp['Date'],
                     format='%Y%m%d')#.dt.strftime("%Y-%m-%d")
polar_temp = polar_temp.set_index(['Station_Number', "Date"])

                           Value
Station_Number Date             
CA002100805    2003-01-01   -296
CA002202570    2003-01-01   -269
CA002203058    2003-01-01   -268
CA002300551    2003-01-01    -23
CA002300902    2003-01-01   -200

Now I would like to perform a resampling of the data by calculating the mean of Value for every 8 days by using:

polar_temp8d = polar_temp.groupby([pd.Grouper(level='Station_Number'),
                                    pd.Grouper(level='Date', freq='8D')]).mean()

                                Value
Station_Number Date                  
CA002100805    2003-01-01 -300.285714
               2003-01-09 -328.750000
               2003-01-17 -325.500000
               2003-01-25 -385.833333
               2003-02-02 -194.428571
...                               ...
USW00027515    2005-06-23   76.625000
               2005-07-01   42.375000
               2005-07-09   94.500000
               2005-07-17   66.500000
               2005-07-25   56.285714

So the problem here is that pandas only resamples the years from 2003 until 2005, so the years from 2006 to 2011 are completely left out. Now my question is: Did I use the Grouper function to analyse time series data correctly or is the anything else I have missed?

Edit 1:

By running:

print(polar_temp.loc['CA002300902'].sort_index(ascending=False))

            Value
Date             
2011-12-31   -288
2011-12-30   -299
2011-12-29   -347
2011-12-28   -310
2011-12-27   -239

One can see that the stations before the resampling have data until 2011.

tillwss
  • 25
  • 5
  • Do certain stations not have data after 2005? I believe the aggregation will only include periods where there is data for a station. – Ian Wright Jun 03 '21 at 14:17
  • But the aggregation should still run through for stations that have data from 2003 to 2011, because as far as I understand it pandas runs the resampling station wise. – tillwss Jun 03 '21 at 14:22
  • Can you double check that by running `df.loc[STATION_WITH_LATER_DATA'].sort_index(ascending=False)`. Additionally, I don't think this will make a difference, but `polar_temp.groupby(['Station Number', pd.Grouper(level='Date', freq='8D')])` is sufficient for the grouping operation. – Ian Wright Jun 03 '21 at 14:30
  • Thanks for asking me double check, because for some stations the resampling has worked for the whole time period, however the row count went down to around 60.000 whilst it should be around 300.000 – tillwss Jun 03 '21 at 14:47
  • I have checked again and even there are data gaps for several years which can't be true. – tillwss Jun 03 '21 at 15:23

1 Answers1

1

I have created synthetic data to test your approach and it worked fine. I then arbitrarily removed data points to see if the aggregation would fail with missing dates and it skips missing values from the time series, as displayed on the output immediately below. Therefore, I still don't understand why your output stops in 2005.

Output without resampling and interpolation:

                                Value
Station_Number Date                  
CA002100805    2003-01-02 -195.545455
               2003-01-10 -144.963636
               2003-01-18 -158.045455
               2003-01-26 -151.533333
               2003-02-03 -196.300000
               2003-04-08 -159.963636
               2003-04-16 -157.115385
               2003-04-24 -150.191489
               2003-05-02 -146.113924
               2003-05-10 -133.367347

Notice how it skips data points in March 2003 altogether.

You might sort your problem by: 1. Adding missing dates to the DataFrame 2. Filling NAs with interpolate()

import pandas as pd
import numpy as np

# Sets random seed
np.random.seed(42)

# Sample size
size=10**5

station_numbers = ['CA002100805', 'CA002202570', 'CA002203058', 'CA002300551',
                   'CA002300902']

stations = [station_numbers[i] for i in
            np.random.randint(low=0, high=len(station_numbers), size=size)]

values = np.random.randint(low=-400, high=100, size=size)

dates_list = pd.date_range(start='2003-01-01', end='2011-12-31')

###################################
#### TESTS with missing dates #####
###################################

# Removes dates from dates_list to test
percent_to_remove = 1/3
items_to_remove = len(dates_list) * percent_to_remove

# Index of items to remove
rem_idx = set()
while len(rem_idx) < items_to_remove:
    # Thanks to Jon Kiparsky's answer on this thread
    # https://stackoverflow.com/questions/28037158/how-to-not-repeat-randint-value
    rem_idx.add(np.random.randint(0, len(dates_list)))

dates_list = dates_list.delete(list(rem_idx))

# Arbitratily removes dates in sequence to test
dates_list = dates_list.delete(range(20, 60))

###################################
###################################

dates = [dates_list[i] for i in
         np.random.randint(low=0, high=len(dates_list), size=size)]

# Creates DataFrame
data = (pd.DataFrame({'Station_Number': stations,
                     'Date': dates,
                     'Value': values})
        .set_index('Date')
        .sort_index())

# Creates one row per day
data = data.groupby('Station_Number').resample('D').mean()

# Fills NAs with standard interpolation strategy
data = data.interpolate()

# Calculates 8-day mean value
eight_day_mean = data.groupby([pd.Grouper(level='Station_Number'),
                               pd.Grouper(level='Date', freq='8D')]).mean()

Output with resampling and interpolation:

                                Value
Station_Number Date                  
CA002100805    2003-01-02 -178.138024
               2003-01-10 -135.644524
               2003-01-18 -147.253977
               2003-01-26 -147.694712
               2003-02-03 -200.642180
               2003-02-11 -203.057708
               2003-02-19 -192.821042
               2003-02-27 -182.584375
               2003-03-07 -172.347708
               2003-03-15 -162.111042
               2003-03-23 -151.874375
               2003-03-31 -141.637708
               2003-04-08 -154.028469
               2003-04-16 -151.099405
               2003-04-24 -156.152083

Now notice how it contains data points for March 2003 that are somewhere in between the values from February and April 2003, due to the interpolation strategy adopted.

  • That is a very good approach, however unfortunately I am not allowed to interpolate the data. I thought about adding the missing dates too, but couldn't come up with a prober idea how. – tillwss Jun 07 '21 at 08:04
  • You can skip the interpolation and the final data set will show NAs where there is no data. Let me know if it works. If it does, I would appreciate if you accepted the answer and upvoted it, please. – Paulo Schau Guerra Jun 07 '21 at 09:10
  • I will, thank you for your help already! I am just struggling with adding the missing dates to the df. My approach was to create an empty df only with the dates of the time period with the df having the station data, to sort of fill the data gaps with NaN. I used: `for station, gp in polar_temp.groupby('Station_Number'): station_merge = pd.merge(gp, df, on=['Date'], how='outer') polar_m = pd.concat([station_merge, polar_merge], axis=0) polar_m['Station_Number'] = station` – tillwss Jun 07 '21 at 09:53
  • Why not use the resample method that I've posted above? `data = data.groupby('Station_Number').resample('D').mean()` – Paulo Schau Guerra Jun 07 '21 at 10:08
  • I am not quite sure, how to pass the values per station into your script. Sorry for all the questions, as you can guess I am quite new to Python – tillwss Jun 07 '21 at 10:20
  • No problem! You can try `polar_temp = polar_temp.groupby('Station_Number').resample('D').mean()` Let me know if it works. If it doesn't work, paste the error here. – Paulo Schau Guerra Jun 07 '21 at 10:32
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/233419/discussion-between-paulo-schau-guerra-and-tillwss). – Paulo Schau Guerra Jun 07 '21 at 10:34