0

I have two Pandas data frames of unequal length, the first one contains data about predicted protein modifications, the second one holds data for experimentally verified protein modifications

The first data frame contains the following columns:

  • protein_id
  • position_predicted
  • modification_predicted

… and looks something like this:

protein_id position_predicted modification_predicted
prot1 135 +
prot1 267 +
prot1 360 -
prot2 59 ++
prot2 135 +++
prot3 308 -

The second data frame contains columns with experimentally verified protein modification positions:

  • protein_id
  • position_experimental

… and looks like so:

protein_id position
prot1 135
prot3 300
prot4 55

protein_id in both columns refers to the same protein, using standard Uniprot identifier

modification_predicted in the first data frame responds to information about the predicted presence of the modification on the position:
‘+’ modification predicted to be present ‘-’ modification predicted to be absent On the contrary, the second data frame holds the position are experimentally (truly) present

Now my global aim is to somehow compare the accuracy of the predictions from data frame one with the experimentally verified modifications from data frame two.

There are 5 cases that I have to count separately:

  • A) experimental data frame and predictions data frame both have same position for same protein and the prediction says the position is truly modified (‘+’ in the modification_predicted) - true positive cases

  • B) position in both data frames is same, but the same prediction says there’s no modification (‘-‘ in the modification_predicted) for the same corresponding protein - false negative cases

  • C) prediction says there’s a modification for the position (‘+’ in the modification_predicted), but the experimental data frame has no corresponding position for this same protein - false positive cases

  • D) prediction says there’s no modification for the position (‘-’ in the modification_predicted) and the experimental data frame has no corresponding position for this same protein - true negative cases

  • E) the experimental data frame positions that do not correspond to any position for the same protein in the prediction data frame - miscellanaous

Now I understand that I need to somewhere iterate over each position of each protein in the prediction data frame over each position for each corresponding protein in the experimental data frame

In pseudo-code the way I see the solution for this problem is something like this

TP = 0
FN = 0
TN = 0
FP = 0
Misc = 0

for protein in df1$protein_id:
   for position in protein[from df1]:
      if {condition for TP}:
         TP += 1
      if {condition for FN}:
         FN += 1
      if {condition for TN}:
         TN += 1
      if (condition for FP):
         FP += 1
      if {condition for misc}:
         Misc += 1

There are two major problems that I face with such a solution.

(1) How do I specify for each condition that I need to compare only same protein positions positions between the two frames, in other words restrict the comparison only to within-single-protein positions, without allowing for inter-protein comparisons
(2) The length of the two frames is unequal

Any ideas how to approach these problems?

1 Answers1

0

You can use merging. Reference: Pandas Merging 101

I assume the index numbers (of both dataframes) are unique. If not, use: df.reset_index()

# Inner merge:
intersection = df_pred.merge(
    df_real, 
    left_on=['protein_id', 'position_predicted'], 
    right_on=['protein_id', 'position']
)

TP = intersection['modification_predicted'].str.contains('+', regex=False).sum()
FN = intersection['modification_predicted'].eq('-').sum()
# FN = len(intersection) - TP  # alternative

And here select elements of both dataframes which are not present in the other one:

unique_pred = df_pred.loc[df_pred.index.difference(intersection.index)]
unique_real = df_real.loc[df_real.index.difference(intersection.index)]
TN = unique_pred['modification_predicted'].eq('-').sum()
FP = unique_pred['modification_predicted'].str.contains('+', regex=False).sum()
# FP = len(unique_pred) - TN  # alternative

Misc = len(unique_real)

Result:

>>> TP, FN, TN, FP, Misc
(1, 0, 2, 3, 2)
Vladimir Fokow
  • 3,728
  • 2
  • 5
  • 27