0

In the first dataframe, the last two columns (shift_one and shift_two) can be thought of as a guess of a potential true coordinate. Call this df1.

df1:

         p_one     p_two            dist  shift_one  shift_two
0        Q8_CB      Q2_C  d_6.71823_Angs     26.821    179.513
1        Q8_CD      Q2_C  d_4.72003_Angs    179.799    179.514
....

In the second dataframe, call this df2, I have a dataframe of experimental observed coordinates which I denote peaks. It simply is just the coordinates and one more column that is for how intense the signal was, this just needs to be along for the ride.

df2:

            A        B        C
0      31.323   25.814   251106
1      26.822   26.083   690425
2      27.021   179.34  1409596
3      54.362   21.773  1413783
4      54.412   20.163   862750
....

I am aiming to have a method for each guess in df1 to be queried/searched/refrenced in df2, within a range of 0.300 of the initial guess in df1. I then want this to be returned in a new datframe, lets say df3. In this case, we notice there is a match in row 0 of df1 with row 2 of df2.

desired output, df3:

         p_one     p_two            dist  shift_one  shift_two  match  match1  match2  match_inten
0        Q8_CB      Q2_C  d_6.71823_Angs     26.821    179.513   TRUE  27.021  179.34      1409596
1        Q8_CD      Q2_C  d_4.72003_Angs    179.799    179.514    NaN     NaN     NaN          NaN
....

I have attempted a few things:

(1) O'Reily suggests dealing with bounds in a list in python by using lambda or def (p 78 of python in a nutshell). So I define a bound function like this.

def bounds (value, l=low, h=high)

I was then thinking that I could just add a new column, following the logic used here (https://stackoverflow.com/a/14717374/3767980).

df1['match'] = ((df2['A'] + 0.3 <= df1['shift_one']) or (df2['A'] + 0.3 => df1['shift_one'])

--I'm really struggling with this statement

Next I would just pull the values, which should be trivial.

(2) make new columns for the upper and lower limit, then run a conditional to see if the value is between the two columns.

Finally:

(a) Do you think I should stay in pandas? or should I move over to NumPy or SciPy or just traditional python arrays/lists. I was thinking that a regular python lists of lists too. I'm afraid of NumPy since I have text too, is NumPy exclusive to numbers/matrices only.

(b) Any help would be appreciated. I used biopython for phase_one and phase_two, pandas for phase_three, and I'm not quite sure for this final phase here what is the best library to use.

(c) It is probably fairly obvious that I'm an amateur programer.

Community
  • 1
  • 1
PhysicalChemist
  • 540
  • 4
  • 14
  • 1
    "is NumPy exclusive to numbers/matrices only" -- no. record/structured arrays can have differently typed columns like a dataframe – Paul H Jan 31 '15 at 00:12
  • I don't understand the format of the solution. Does each row in df1 only ever match at most 1 row in df2? – Alex Jan 31 '15 at 08:26
  • Yes that is correct; df2 is massive, dramatically larger. I'm looking up a match where both numerical values are within a tolerance of 0.3 of their respective entry. Then a return of a match and a print of the two values that met both tolerances and their corresponding intensity. – PhysicalChemist Jan 31 '15 at 08:40

1 Answers1

1

The following assumes that the columns to compare have the same names.

def temp(row):
    index = df2[((row-df2).abs() < .3).all(axis=1)].index
    return df2.loc[index[0], :] if len(index) else [None]*df2.shape[1]

Eg.

df1 = pd.DataFrame([[1,2],[3,4], [5,6]], columns=["d1", "d2"])
df2 = pd.DataFrame([[1.1,1.9],[3.2,4.3]], columns=["d1", "d2"])
df1.apply(temp, axis=1)

produces

    d1   d2
0  1.1  1.9
1  3.2  4.3
2  NaN  NaN
Alex
  • 18,484
  • 8
  • 60
  • 80