Pandas Datframe1 search for match in range of Dataframe2

Question

In the first dataframe, the last two columns (shift_one and shift_two) can be thought of as a guess of a potential true coordinate. Call this df1.

df1:

         p_one     p_two            dist  shift_one  shift_two
0        Q8_CB      Q2_C  d_6.71823_Angs     26.821    179.513
1        Q8_CD      Q2_C  d_4.72003_Angs    179.799    179.514
....

In the second dataframe, call this df2, I have a dataframe of experimental observed coordinates which I denote peaks. It simply is just the coordinates and one more column that is for how intense the signal was, this just needs to be along for the ride.

df2:

            A        B        C
0      31.323   25.814   251106
1      26.822   26.083   690425
2      27.021   179.34  1409596
3      54.362   21.773  1413783
4      54.412   20.163   862750
....

I am aiming to have a method for each guess in df1 to be queried/searched/refrenced in df2, within a range of 0.300 of the initial guess in df1. I then want this to be returned in a new datframe, lets say df3. In this case, we notice there is a match in row 0 of df1 with row 2 of df2.

desired output, df3:

         p_one     p_two            dist  shift_one  shift_two  match  match1  match2  match_inten
0        Q8_CB      Q2_C  d_6.71823_Angs     26.821    179.513   TRUE  27.021  179.34      1409596
1        Q8_CD      Q2_C  d_4.72003_Angs    179.799    179.514    NaN     NaN     NaN          NaN
....

I have attempted a few things:

(1) O'Reily suggests dealing with bounds in a list in python by using lambda or def (p 78 of python in a nutshell). So I define a bound function like this.

def bounds (value, l=low, h=high)

I was then thinking that I could just add a new column, following the logic used here (https://stackoverflow.com/a/14717374/3767980).

df1['match'] = ((df2['A'] + 0.3 <= df1['shift_one']) or (df2['A'] + 0.3 => df1['shift_one'])

--I'm really struggling with this statement

Next I would just pull the values, which should be trivial.

(2) make new columns for the upper and lower limit, then run a conditional to see if the value is between the two columns.

Finally:

(a) Do you think I should stay in pandas? or should I move over to NumPy or SciPy or just traditional python arrays/lists. I was thinking that a regular python lists of lists too. I'm afraid of NumPy since I have text too, is NumPy exclusive to numbers/matrices only.

(b) Any help would be appreciated. I used biopython for phase_one and phase_two, pandas for phase_three, and I'm not quite sure for this final phase here what is the best library to use.

(c) It is probably fairly obvious that I'm an amateur programer.

"is NumPy exclusive to numbers/matrices only" -- no. record/structured arrays can have differently typed columns like a dataframe — Paul H, Jan 31 '15 at 00:12
I don't understand the format of the solution. Does each row in df1 only ever match at most 1 row in df2? — Alex, Jan 31 '15 at 08:26
Yes that is correct; df2 is massive, dramatically larger. I'm looking up a match where both numerical values are within a tolerance of 0.3 of their respective entry. Then a return of a match and a print of the two values that met both tolerances and their corresponding intensity. — PhysicalChemist, Jan 31 '15 at 08:40

score 1 · Accepted Answer · answered Jan 31 '15 at 18:33

1

The following assumes that the columns to compare have the same names.

def temp(row):
    index = df2[((row-df2).abs() < .3).all(axis=1)].index
    return df2.loc[index[0], :] if len(index) else [None]*df2.shape[1]

Eg.

df1 = pd.DataFrame([[1,2],[3,4], [5,6]], columns=["d1", "d2"])
df2 = pd.DataFrame([[1.1,1.9],[3.2,4.3]], columns=["d1", "d2"])
df1.apply(temp, axis=1)

produces

    d1   d2
0  1.1  1.9
1  3.2  4.3
2  NaN  NaN

answered Jan 31 '15 at 18:33

Alex

18,484
8
60
80

Thank you, this was very helpful and I got my script running. – PhysicalChemist Jan 31 '15 at 21:36

Pandas Datframe1 search for match in range of Dataframe2

1 Answers1