Consider the following problem. A data frame has a column of timestamps that we will represent as integers (we cannot necessarily use date time for this problem, so we will stick with integers). Another data frame has a column of timestamps which are matched to the first data frame, but there was some error in the measurements, meaning that measurements from the first data frame have +5 or -5 added to them in the second data frame.
We want to match up the timestamps, and if we cannot do this because it is ambiguous, we should put nan
.
df1 = pd.DataFrame.from_dict({"time1" : [1,2,3,4]})
df2 = pd.DataFrame.from_dict({"time2" : [1.2,4.3,2.3,7.9,2.9, 3.9]})
df3 = pd.DataFrame.from_dict({"time1" : [1,2,3,4], "time2" : [1.2, 2.3, 2.9, np.nan]})
Here df1
is the exactly measured data frame, df2
contains measurements with noise errors of +-0.5, and df3
is my desired output. Note that it is ambiguous as to the timestamp in df2
that 4
should match to (it could be 3.9
or 4.3
, we cannot tell which), so there should be np.nan
in that column.
Please note crucially that df2
might have more measurements than df1
does.
I have currently tried using a standard .apply
statement with a lambda function. This takes a timestamp in df1
, and then it brute force searches for a timestamp in df2
which is at most +-0.5 away from the timestamp in df1
. This works, but it is not fast enough for the amount of measurements I have in the experiment I'm working with. I need the most efficient solution possible.