2

I have a large (>32 M rows) Pandas dataframe. In column 'Time_Stamp' I have a Unix timestamp in seconds. These values are not linear, there are gaps, and some timestamps can be duplicated (ex: 1, 2, 4, 6, 6, 9,...). I would like to set column 'Result' of current row to the index of the row that is 60 seconds before current row (closest match if there are no rows exactly 60 seconds before current row, and if more than one match, take maximum of all matches).

I've tried this to first get the list of indexes, but it always return an empty list:

df.index[df['Time_Stamp'] <= df.Time_Stamp-60].tolist()

I cannot use a for loop due to the large number of rows.

Edit 20.01.2020: Based on comment below, I'm adding a sample dataset, and instead of returning the index I want to return the column Value:

In [2]: df
Out[2]: 
   Time_Stamp  Value
0  1  2.4
1  2  3.1
2  4  6.3
3  6  7.2
4  6  6.1
5  9  6.0
Hugues
  • 197
  • 1
  • 5
  • 19
  • 3
    Look into [`pd.merge_asof`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html). Keep the original and make a copy of the DataFrame. Subtract 60s from the 'Time Stamp' column of the copy, then an `asof` merge between the original and the copy will find the closest row that was 60s in the past (set a direction if you want closer or farther in the case of no exact match). If you want the max in the case of ties, then you should sort the copied dataframe by `['Timestamp', 'Result'], ascending=[False, False]` prior to the merge – ALollz Jan 18 '20 at 20:30
  • Excellent ! Merge takes about 5 seconds on 32M rows. Pandas Dataframes are impressive. How can I mark this as an answer ? Or should I answer it myself and refer to your comment ? – Hugues Jan 19 '20 at 19:58
  • super glad it solved your problem! The only reason I didn't provide a solution was because there was no sample data, and often without sample data and an expected output the an attempted solution falls short. If you want to create some sample data , just a few rows for instance, (see https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and add it to your problem then I'd be happy to fill out an answer. – ALollz Jan 19 '20 at 21:26
  • Also, if you 're going to take the time to improve the question, you can by all means answer your own question with what you've learned. I'm more than happy to upvote good questions and solutions :D. – ALollz Jan 19 '20 at 21:29
  • What is your expected output from the sample dataframe given? – Scott Boston Jan 20 '20 at 19:39
  • @ALollz you are a genius. thanks for this answer. – lb_so Apr 06 '21 at 04:21

1 Answers1

1

So with the precious help of ALollz, I managed to achieve what i wanted to do in the end, here's my code:

#make copy of dataframe
df2 = df[['Time_Stamp','Value']].copy()
#add Time_gap to Time_Stamp in df2
df2['Time_Stamp'] = df2.Time_Stamp +Time_gap
#sort df2 on Time_Stamp 
df2.sort_values(by = 'Time_Stamp', ascending=True,inplace = True)
df2 = df2.reset_index(drop=True)
df3 = pd.merge_asof(df, df2, on='Time_Stamp', direction='forward')
Hugues
  • 197
  • 1
  • 5
  • 19