Iterate elements of pandas column over elements of another column from a different data frame of unequal length

Question

I have two Pandas data frames of unequal length, the first one contains data about predicted protein modifications, the second one holds data for experimentally verified protein modifications

The first data frame contains the following columns:

protein_id
position_predicted
modification_predicted

… and looks something like this:

protein_id	position_predicted	modification_predicted
prot1	135	+
prot1	267	+
prot1	360	-
prot2	59	++
prot2	135	+++
prot3	308	-
…	…	…

The second data frame contains columns with experimentally verified protein modification positions:

protein_id
position_experimental

… and looks like so:

protein_id	position
prot1	135
prot3	300
prot4	55
…	…

protein_id in both columns refers to the same protein, using standard Uniprot identifier

modification_predicted in the first data frame responds to information about the predicted presence of the modification on the position:
‘+’ modification predicted to be present ‘-’ modification predicted to be absent On the contrary, the second data frame holds the position are experimentally (truly) present

Now my global aim is to somehow compare the accuracy of the predictions from data frame one with the experimentally verified modifications from data frame two.

There are 5 cases that I have to count separately:

A) experimental data frame and predictions data frame both have same position for same protein and the prediction says the position is truly modified (‘+’ in the modification_predicted) - true positive cases
B) position in both data frames is same, but the same prediction says there’s no modification (‘-‘ in the modification_predicted) for the same corresponding protein - false negative cases
C) prediction says there’s a modification for the position (‘+’ in the modification_predicted), but the experimental data frame has no corresponding position for this same protein - false positive cases
D) prediction says there’s no modification for the position (‘-’ in the modification_predicted) and the experimental data frame has no corresponding position for this same protein - true negative cases
E) the experimental data frame positions that do not correspond to any position for the same protein in the prediction data frame - miscellanaous

Now I understand that I need to somewhere iterate over each position of each protein in the prediction data frame over each position for each corresponding protein in the experimental data frame

In pseudo-code the way I see the solution for this problem is something like this

TP = 0
FN = 0
TN = 0
FP = 0
Misc = 0

for protein in df1$protein_id:
   for position in protein[from df1]:
      if {condition for TP}:
         TP += 1
      if {condition for FN}:
         FN += 1
      if {condition for TN}:
         TN += 1
      if (condition for FP):
         FP += 1
      if {condition for misc}:
         Misc += 1

There are two major problems that I face with such a solution.

(1) How do I specify for each condition that I need to compare only same protein positions positions between the two frames, in other words restrict the comparison only to within-single-protein positions, without allowing for inter-protein comparisons
(2) The length of the two frames is unequal

Any ideas how to approach these problems?

why are there several pluses? `++` and `+++`. Should they all count as `+` ? — Vladimir Fokow, Aug 31 '22 at 21:03
@VladimirFokow depends on the strength of the prediction, but initially I planned to count all of them as just one plus, i.e. modification present — Tony Zhelonkin, Sep 01 '22 at 05:01

score 0 · Answer 1 · answered Sep 01 '22 at 07:51

You can use merging. Reference: Pandas Merging 101

I assume the index numbers (of both dataframes) are unique. If not, use: df.reset_index()

# Inner merge:
intersection = df_pred.merge(
    df_real, 
    left_on=['protein_id', 'position_predicted'], 
    right_on=['protein_id', 'position']
)

TP = intersection['modification_predicted'].str.contains('+', regex=False).sum()
FN = intersection['modification_predicted'].eq('-').sum()
# FN = len(intersection) - TP  # alternative

And here select elements of both dataframes which are not present in the other one:

unique_pred = df_pred.loc[df_pred.index.difference(intersection.index)]
unique_real = df_real.loc[df_real.index.difference(intersection.index)]

TN = unique_pred['modification_predicted'].eq('-').sum()
FP = unique_pred['modification_predicted'].str.contains('+', regex=False).sum()
# FP = len(unique_pred) - TN  # alternative

Misc = len(unique_real)

Result:

>>> TP, FN, TN, FP, Misc
(1, 0, 2, 3, 2)

Iterate elements of pandas column over elements of another column from a different data frame of unequal length

1 Answers1