0

I'm trying to do interpolation on a pandas DataFrame that contains time series data. I have hourly data for temp, and I want to interpolate the temp values at the half-hourly points. This way, I estimate temp for every trading period for each day, ie. 24h per day so 48 trading periods per day.

My MWE is

import numpy as np
import pandas as pd
from datetime import datetime, date, timedelta
import pyarrow as pa
import pyarrow.parquet as pq

# my dataset
df = pd.DataFrame()
d1 = '2020-10-21'
d2 = '2020-10-22'
df['date'] = pd.to_datetime([d1]*24+[d2]*24, format='%Y-%m-%d')
df['time'] = pd.date_range(d1, periods=len(df), freq='H').time
df['temp'] = pd.DataFrame((50+20*np.sin(np.linspace(0,0.91*np.pi,len(df))))).values

# combine time and date
df.loc[:,'datetime'] = pd.to_datetime(df.date.astype(str)+' '+df.time.astype(str))
df = df.drop(['date','time'], axis=1)
df = df.set_index('datetime')

# trading period
df['tp'] = pd.DataFrame(df.index.hour.values*2+1).values

# interpolate to find temp and datetime for trading periods 2,4,6,...
for n in df.tp.values:
    df.loc[-1,'tp'] = n+1
    df = df.sort_values('tp').reset_index(drop=True)

#df = df.interpolate(method='linear')

print(df.head(10))

I'm adapting the answer in this post, but I get the error TypeError: value should be a 'Timestamp' or 'NaT'. Got 'int' instead. I suspect it's due to the df.loc[-1,'tp'] = n+1 line but not sure how to fix it.

Medulla Oblongata
  • 3,771
  • 8
  • 36
  • 75

2 Answers2

1

Try:

df = df.resample('30T').mean().interpolate()
df['tp'] = ((df.index.hour * 60 + df.index.minute) / 30 + 1).astype(int)
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Thanks, your method mostly works. I noticed that `tp` goes `45`, `46`, `47`, `24`, `1`, `2`, `3`, ... so I'd prefer if that `24` is `48` instead. – Medulla Oblongata Jul 19 '21 at 22:04
  • Awesome. Is there a more general way of doing it though? The `df['tp'] = 1` line implies that the data starts at midnight `00:00:00`, which isn't always the case. – Medulla Oblongata Jul 19 '21 at 22:21
  • @MedullaOblongata, can you check my new answer, please? I hope it's right now. – Corralien Jul 19 '21 at 22:31
0

Try asfreq and then interpolate:

In [36]: df.asfreq('30T').interpolate()
Out[36]:
                          temp    tp
datetime
2020-10-21 00:00:00  50.000000   1.0
2020-10-21 00:30:00  50.607891   2.0
2020-10-21 01:00:00  51.215782   3.0
2020-10-21 01:30:00  51.821424   4.0
2020-10-21 02:00:00  52.427066   5.0
...                        ...   ...
2020-10-22 21:00:00  57.869280  43.0
2020-10-22 21:30:00  57.303145  44.0
2020-10-22 22:00:00  56.737010  45.0
2020-10-22 22:30:00  56.158416  46.0
2020-10-22 23:00:00  55.579822  47.0

[95 rows x 2 columns]
dicristina
  • 335
  • 2
  • 13