I have 2 data frames
print(df1)
Name df1 RT [min] Molecular Weight RT [min]+0.2 RT [min]-0.2 Molecular Weight + 0.2 Molecular Weight - 0.2
0 unknow compound 1 7.590 194.04212 7.790 7.390 194.24212 193.84212
1 unknow compound 2 7.510 194.15000 7.710 7.310 194.35000 193.95000
2 unknow compound 3 7.410 194.04209 7.610 7.210 194.24209 193.84209
3 unknow compound 4 7.434 342.11615 7.634 7.234 342.31615 341.91615
4 unknow compound 5 0.756 176.03128 0.956 0.556 176.23128 175.83128
and
print(df2)
Name df2 Molecular Weight RT [min]
0 β-D-Glucopyranuronic acid 194.04220 7.483
1 α,α-Trehalose 194.10000 7.350
2 Threonylserine 206.08970 8.258
3 Terephthalic acid 166.02595 7.465
4 Sulfuric acid 97.96714 8.909
I want to merge the rows from df2 to those in df1 if 2 conditions are met.
- RT [min] value of df2 is in the intervale of RT [min]-0.2 and RT [min]+0.2 from df1
- AND Molecular Weigh the value of df2 is in the intervale of Molecular Weight - 0.2 and Molecular Weight + 0.2 from df1.
This means that if a row from df2 meets the conditions of two other rows from df1, the row of df1 will be duplicated.
so df3 should look
print(df3)
Name df1 RT [min]+0.2 RT [min]-0.2 Molecular Weight + 0.2 Molecular Weight - 0.2 Name df2 Molecular Weight RT [min]
0 unknow compound 1 7.790 7.390 194.24212 193.84212 β-D-Glucopyranuronic acid 194.0422 7.483
1 unknow compound 1 7.790 7.390 194.24212 193.84212 α,α-Trehalose 194.1000 7.350
2 unknow compound 2 7.710 7.310 194.35000 193.95000 β-D-Glucopyranuronic acid 194.0422 7.483
3 unknow compound 3 8.310 7.910 206.30000 205.90000 Threonylserine 206.0897 8.258
4 unknow compound 4 7.634 7.234 342.31615 341.91615 NaN NaN NaN
5 unknow compound 5 0.956 0.556 176.23128 175.83128 NaN NaN NaN
the first row in df2 meets the 2 conditions of the unknow compound 1 and unknow compound 2 from df1 so we have it twice in df3.
the second row in df2 meets the 2 conditions of the unknow compound 1 only.
the third row in df2 meets the 2 conditions of the unknow compound 3 only.
all the other rows do not meet any of the conditions from df1.
I tried to do it based on How to join two data frames for which column values are within a certain range? the first answer
import pandas as pd
df_1 = pd.read_excel (r'D:\CD SandBox\df1.xlsx')
df_2 = pd.read_excel (r'D:\CD SandBox\df2.xlsx')
df2.index = pd.IntervalIndex.from_arrays(df2['RT [min]-0.2'],df2['RT [min]+0.2'],closed='both')
df2['RT [min]'] = df2['RT [min]'].apply( lambda x : df2.iloc[df1.index.get_loc(x)])
But do not know what to do with the second line of code and receive this error:
df2['RT [min]'] = df2['RT [min]'].apply( lambda x : df2.iloc[df1.index.get_loc(x)])
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\BCDD\Anaconda3\envs\PTSD\lib\site-packages\pandas\core\series.py", line 4213, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas\_libs\lib.pyx", line 2403, in pandas._libs.lib.map_infer
File "<input>", line 1, in <lambda>
File "C:\Users\BCDD\Anaconda3\envs\PTSD\lib\site-packages\pandas\core\indexes\interval.py", line 730, in get_loc
raise KeyError(key)
KeyError: 8.258
EDIT: Trying using merge_asof
according to How to join two DataFrames with multiple overlapping timestamps using an extra shared variable
df2 = df2.drop(['RT [min]', 'Molecular Weight'], axis=1)
df2['RT [min]']=df2['RT [min]-0.2']
pd.merge_asof(df2[['RT [min]','Name df2']] , df1,on='RT [min]',direction ='forward',allow_exact_matches =True)
...
RT [min] Name df2 Name df1 Molecular Weight
0 0.556 unknow compound 5 α,α-Trehalose 194.10000
1 7.210 unknow compound 3 α,α-Trehalose 194.10000
2 7.234 unknow compound 4 α,α-Trehalose 194.10000
3 7.310 unknow compound 2 α,α-Trehalose 194.10000
4 7.390 unknow compound 1 Terephthalic acid 166.02595
gives wrong matching for the table.
Any idea\ hint will be appreciated