0

Consider the following problem. A data frame has a column of timestamps that we will represent as integers (we cannot necessarily use date time for this problem, so we will stick with integers). Another data frame has a column of timestamps which are matched to the first data frame, but there was some error in the measurements, meaning that measurements from the first data frame have +5 or -5 added to them in the second data frame.

We want to match up the timestamps, and if we cannot do this because it is ambiguous, we should put nan.

df1 = pd.DataFrame.from_dict({"time1" : [1,2,3,4]})
df2 = pd.DataFrame.from_dict({"time2" : [1.2,4.3,2.3,7.9,2.9, 3.9]})
df3 = pd.DataFrame.from_dict({"time1" : [1,2,3,4], "time2" : [1.2, 2.3, 2.9, np.nan]})

Here df1 is the exactly measured data frame, df2 contains measurements with noise errors of +-0.5, and df3 is my desired output. Note that it is ambiguous as to the timestamp in df2 that 4 should match to (it could be 3.9 or 4.3, we cannot tell which), so there should be np.nan in that column.

Please note crucially that df2 might have more measurements than df1 does.

I have currently tried using a standard .apply statement with a lambda function. This takes a timestamp in df1, and then it brute force searches for a timestamp in df2 which is at most +-0.5 away from the timestamp in df1. This works, but it is not fast enough for the amount of measurements I have in the experiment I'm working with. I need the most efficient solution possible.

Maria C
  • 23
  • 4
  • Does this answer your question? [How to join two dataframes for which column values are within a certain range?](https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range) – Mark Jul 18 '23 at 11:03
  • no, thanks for checking @Mark – Maria C Jul 18 '23 at 12:45

1 Answers1

1

Try the merge_asof function, note you will have to sort the data frames beforehand, let me know if this doesn't make sense:

df1 = df1.sort_values(by='time1')
df2 = df2.sort_values(by='time2')

# merge_asof with a tolerance of 0.5
merged = pd.merge_asof(df1, df2, left_on='time1', right_on='time2', direction='nearest', tolerance=0.5)

# replace the non-matched values with NaN
merged['time2'] = np.where(merged['time2'].isnull(), np.nan, merged['time2'])

fnqwejflqo
  • 21
  • 1
  • This does not work, it gives an error to do with datatypes? `Function call with ambiguous argument types`. Note in my problem the time1, time2 columns are int and the tolerance is 100. I cannot convert these columns to datetimes, they need to be int. – Maria C Jul 18 '23 at 11:14