0

I'm trying to create a data frame from a start date and end date, for a number of asset_id's and turn it into a list of half-hours for each asset_id between the start and end date with the values of some_property filled forward.

I've tried Grouped and resample from the documentation and examples from SO but am stumped how to get this done.

Consider example:

some_time = datetime(2018,4,2,20,20,42)
start_date = datetime(some_time.year,some_time.month,some_time.day).astimezone(pytz.timezone('Europe/London'))
end_date = start_date + timedelta(days=1)
start_date = start_date + timedelta(hours=some_time.hour,minutes=(0 if some_time.minute < 30 else 30 ))
df = pd.DataFrame(['A','B'],columns=['asset_id'])
df2=df.copy()
df['HH'] = start_date
df2['HH'] = end_date
df['some_property']=0
df.loc[df['asset_id']=='B','some_property']=2
df = df.append(df2)

print(df)

Example df:

                          HH asset_id  some_property
0  2018-04-02 20:00:00+01:00        A            0.0
1  2018-04-02 20:00:00+01:00        B            2.0
0  2018-04-03 00:00:00+01:00        A            NaN
1  2018-04-03 00:00:00+01:00        B            NaN

My desired df is this:

                           HH asset_id  some_property
0   2018-04-02 20:00:00+01:00        A              0
4   2018-04-02 20:30:00+01:00        A              0
6   2018-04-02 21:00:00+01:00        A              0
...
2   2018-04-03 00:00:00+01:00        A              0
1   2018-04-02 20:00:00+01:00        B              2
5   2018-04-02 20:30:00+01:00        B              2
...
3   2018-04-03 00:00:00+01:00        B              2

example by dumb for-loop as below.

for ii in range(1,((24-some_time.hour)*2)-1):
    temp_time = start_date + timedelta(minutes=30*ii)
    for jj in ['A','B']:
        df2 = df.append({'asset_id':jj,'HH':temp_time,'some_property':(0 if jj=='A' else 2)},ignore_index=True)
        df = df2
df.loc[2,'some_property']=0
df.loc[3,'some_property']=2

Can anyone help? I'm trying to create dummy data and having trouble. In reality my dummy data will have multiple rows per asset_id before resampling and ffill-ing.

Update

This might be on the right track:

df2 = df.set_index('HH').groupby('asset_id').resample('30T').ffill().drop('asset_id',axis=1)

gives me:

                                   some_property
asset_id HH                                      
A        2018-04-02 20:00:00+01:00            0.0
         2018-04-02 20:30:00+01:00            0.0
         2018-04-02 21:00:00+01:00            0.0
...
         2018-04-03 00:00:00+01:00            NaN
B        2018-04-02 20:00:00+01:00            2.0
         2018-04-02 20:30:00+01:00            2.0
...
         2018-04-03 00:00:00+01:00            NaN

I'm happy to have gained a multindex and can set the last datetime manually (or can exclude row actually).

Does this look like a sensible method? If so, question answered!

Jim O
  • 136
  • 1
  • 6
  • resample + ffill – BENY Apr 02 '18 at 14:50
  • How many duff attempts of using resample and ffill shall should I show you I've tried? :D I'll try and find the link to the example that got me closest.... – Jim O Apr 02 '18 at 14:56
  • Is that possible create your df with pd.date_range ? – BENY Apr 02 '18 at 15:00
  • I could do but I'm still adding arbitrary (to you!) times with different values for "some_property" before ffillling....or maybe I should add them after!? I could add them after for this example but ultimately I will be wanting to turn sparse (datetimes) matrices I'm pulling from somewhere else into-half-hourly dfs. – Jim O Apr 02 '18 at 15:03
  • This looks like it comes close but is for interpolate not ffill and seemed a dead end to me. https://stackoverflow.com/questions/37057187/pandas-interpolate-within-a-groupby – Jim O Apr 02 '18 at 15:05

1 Answers1

0

This isn't quite what my example answer gave but it will do!

df2 = df.set_index('HH').groupby('asset_id').resample('30T').ffill().drop('asset_id',axis=1)

(Got close to this before but typo meant I thought the approach didn't work!)

Jim O
  • 136
  • 1
  • 6