1

I have a dataframe that conains gps locations of vehicles recieved at various times in a day. For each vehicle, I want to resample hourly data such that I have the median report (according to the time stamp) for each hour of the day. For hours where there are no corresponding rows, I want a blank row. I am using the following code:

for i,j in enumerate(list(df.id.unique())):
        data=df.loc[df.id==j]        
        data['hour']=data['timestamp'].hour
        data_grouped=data.groupby(['imo','hour']).median().reset_index()
        data = data_grouped.set_index('hour').reindex(idx).reset_index() #idx is a list of integers from 0 to 23.


Since my dataframe has millions of id's it takes me a lot of time to iterate though all of them. Is there an efficient way of doing this?

Unlike Pandas reindex dates in Groupby, I have multiple rows for each hour, in addition to some hours having no rows at all.

M_K_
  • 45
  • 1
  • 6

1 Answers1

1

Tested in last version of pandas, convert hour column to categoricals with all possible categories and then aggregate without loop:

df['hour'] = pd.Categorical(df['timestamp'].dt.hour, categories=range(24))
df1 = df.groupby(['id','imo','hour']).median().reset_index()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252