0

I have two dataframe , showing below :

A Dataframe(about 50k row):

ID Timestamp
1 2021/4/28 01:00:00
2 2021/4/27 01:03:02
1 2021/4/28 02:05:01
... ...

And B Dataframe(about 16M row):

ID Timestamp Activities
1 2021/4/28 00:59:58 30
2 2021/4/27 01:02:58 27
1 2021/4/28 02:04:07 44
1 2021/4/28 02:04:08 45
... ... ...

I need to find out the Activities value in B dataframe which has the same ID and has the closest Timestamp in A dataframe , and incert to A Dataframe's new column.

Now I'm trying solving this question by simple loop , it works , but it's too slow. Here is the pseudo code :

  • Iterate each row from A dataframe and get ID and Timestamp
  • filter B by ID and make sure B Timestamp in B is less than or equal to A timestamp, sort decendingly by B's timestamp and get the 1st Activities value
  • insert to A's new column in current row
  • drop B's row which I have just used
  • go forward to next row

But as I mentioned , this is too slow, how can I speed up this procedure ?

Edison Wu
  • 47
  • 2
  • Could you please add a minimal reproductive example ? See here for details https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. Regarding your problem, you should join both dataframes on ID and then search for closest values using window analytics, other approach are also possible - i.e using data.table rolling joins. If you provide a reproductive example, we can proceed further. – Pawel Stradowski Apr 28 '21 at 10:01
  • Not entirely sure if compuationally more efficient, but I guess it is, so I would first filter data frame B on IDs that are also in data frame A. This way, your loop (even if you do it row by row) would search one ID at a time in a ~50k data frame instead of a 16m data frame. – deschen Apr 28 '21 at 10:55

0 Answers0