4

I have a time series dataframe with dates|weather information that looks like this:

enter image description here

2017-01-01 5
2017-01-02 10
.
.
2017-12-31 6

I am trying to upsample it to hourly data using the following: weather.resample('H').pad()

I expected to see 8760 entries for 24 intervals * 365 days. However, it only returns 8737 with the last 23 intervals missing for 31st of december. Is there something special I need to do to get 24 intervals for the last day?

Thanks in advance.

Umar.H
  • 22,559
  • 7
  • 39
  • 74
Gopakumar G
  • 106
  • 3
  • 2
    Hi There, welcome to SO, great first question. You're very close to having an [mcve] I would suggest providing some code to recreate your data frame or paste it in as text. – Umar.H Aug 29 '20 at 16:35
  • 1
    Prepare a valid value for 2018-01-01 at the end of the data and call resample. If you delete the data of 2018-01-01 after resample, there will be 8760 data. – kunif Aug 29 '20 at 16:41
  • @kunif you can post it as an answer – Willy satrio nugroho Aug 30 '20 at 03:25

2 Answers2

2

Pandas normalizes 2017-12-31 to 2017-12-31 00:00 and then creates a range that ends in that last datetime... I would include a last row before resampling with

df.loc['2018-01-01'] = 0

Edit: You can get the result you want with numpy.repeat

Take this df

np.random.seed(1)
weather = pd.DataFrame(index=pd.date_range('2017-01-01', '2017-12-31'),
    data={'WEATHER_MAX': np.random.random(365)*15})

            WEATHER_MAX
2017-01-01     6.255330
2017-01-02    10.804867
2017-01-03     0.001716
2017-01-04     4.534989
2017-01-05     2.201338
...                 ...
2017-12-27     4.503725
2017-12-28     2.145087
2017-12-29    13.519627
2017-12-30     8.123391
2017-12-31    14.621106

[365 rows x 1 columns]

By repeating on axis=1 you can then transform the default range(24) column names to hourly timediffs

# repeat, then stack
hourly = pd.DataFrame(np.repeat(weather.values, 24, axis=1),
    index=weather.index).stack()

# combine date and hour
hourly.index = (
    hourly.index.get_level_values(0) +
    pd.to_timedelta(hourly.index.get_level_values(1), unit='h')
)
hourly = hourly.rename('WEATHER_MAX').to_frame()

Output

                     WEATHER_MAX
2017-01-01 00:00:00     6.255330
2017-01-01 01:00:00     6.255330
2017-01-01 02:00:00     6.255330
2017-01-01 03:00:00     6.255330
2017-01-01 04:00:00     6.255330
...                          ...
2017-12-31 19:00:00    14.621106
2017-12-31 20:00:00    14.621106
2017-12-31 21:00:00    14.621106
2017-12-31 22:00:00    14.621106
2017-12-31 23:00:00    14.621106

[8760 rows x 1 columns]
RichieV
  • 5,103
  • 2
  • 11
  • 24
  • Thanks for helping me understand the issue. I was able to fix it with the suggestions above although I was expecting an in-built functionality within resample to get what I needed. – Gopakumar G Aug 30 '20 at 06:37
  • I understand, if you think about it, this should be the expected behavior, so when you have datetime instead of just date (e.g. `2017-12-31 15:45`), the resample will not go past that last datetime – RichieV Aug 30 '20 at 07:06
  • May I ask what you will be using this resample for? perhaps you can avoid resampling and take advantage of numpy's broadcasting – RichieV Aug 30 '20 at 07:07
  • 1
    Trying to preprocess and combine energy consumption data that is hourly with weather data given in daily format. I am trying to resample the weather data to hourly so I have hourly data across the board. I did not encounter the issue with consumption data since it was already hourly. – Gopakumar G Aug 30 '20 at 07:24
0

What to do and the reason are the same as @RichieV's answer.
However, the value to be used is not 0 or a meaningless value, it is necessary to use valid data actually measured on 2018-01-01.

This is because using a meaningless value reduces the effectiveness of the resampled 2017-12-31 data and the results derived using that data.

  1. Prepare a valid value for 2018-01-01 at the end of the data.
  2. Call resample.
  3. Delete the data of 2018-01-01 after resample.
  4. You will get 8670 data for 2017.

Look at @RichieV's modified answer:

I was misunderstanding the question.
My answer was to complement resample with interpolate etc.
resampleを用いた外挿 (データ補間) を行いたい

If the same value as 00:00 on the day is all right, it would be a different way of thinking.

kunif
  • 4,060
  • 2
  • 10
  • 30