I have two data frames.
Data Frame #1 (df1):
1 object_name time_epoch_ms source data
2 a 1538518822490 source_1 some_data
3 b 1538528822490 source_2 some_data
4 b 1538538822490 source_2 some_data
5 b 1538548822490 source_3 some_data
6 b 1538558822490 source_1 some_data
7 c 1538568822490 source_2 some_data
8 c 1538578822490 source_2 some_data
9 c 1538588822490 source_2 some_data
... etc etc
Data Frame #2(df2):
1 object_name time_epoch_ms new_data
2 a 1538518722490 x
3 b 1538528822490 y
4 b 1538518922490 z
5 b 1538519922490 a
6 b 1538598822490 b
7 c 1538548822490 c
8 c 1538538822490 c
9 c 1538528822490 d
... etc etc
Entries in these two tables give information about objects with object_names for different points in time.
I would like to combine these two tables such that Data Frame #2 is augmented with the source the object_name was using at a particular time specified in Data Frame #1.
Problem: Timestamps in the two tables do not exactly match. Some timestamps that exist in table 2 do not exist in table 1. However, they should roughly correspond to the hour. Some timestamps in Table 1 are missing, so the "last available data" is best.
Is there a way to execute the following steps using pandas' merge ability?
- Match data from dataframes such that df2 gets "source" data from df1 based on object_name and time_stamp such that time_stamp matches to the closest hour.
- If data for a particular hour is in df2, but not df1, then source is retried from the last available hour of data, so it is "filled" based on whatever data exists for that object.
- If object in df2 is not in df1, then it gets a "null"
My approach:
I currently do .apply to get each row from df2 and find all the timestamps for that object in df1, if they exist. Then, I return the closest match or null.
I am wondering if there is a more elegant way to do this using pandas' merge or concat functionality, but I am having trouble understanding how to use them in this case and how to handle filling data in and matching based to the hour (without doing separate pre-processing to get an hour column in there).