1

I have a dataset set which looks like this:

   uniprot site netphorest
1   C9J0A7  169   0.064921
3   C9J0A7  169   0.063045
4   C9J0A7  169   0.055366
9   C9J0A7  169   0.055366
10  C9J0A7  169   0.055366
11  C9J0A7  169   0.055577
14  C9J0A7  169   0.054875
15  C9J0A7  169   0.054875
16  C9J0A7  169   0.054875
22  C9J0A7  169   0.430742
23  C9J0A7  169   0.430742

multiple entries for the same uniprot identifier and modification site, but each entry has multiple netphorest scores (the likelihood of it being modified by a particular enzyme) and over 42,000 observations. essentially i want to select the highest score for a particular uniprot/site row.

I have tried to do something like this (1hCX is my data frame)

CX1href <- subset.data.frame(CX1h, netphorest = max)

where I am trying to subset the the rows based on the largest variable in the netphorest column, however, my new data frame still contains the same number of entries as the original data frame. Not sure how to approach this issue as I have multiple entries with the same uniprot code and site number...

I tried this out and got this error:

CX1href <- aggregate.data.frame(netphorest = ~ uniprot + site, CX1h, FUN = mean, max)
Error in aggregate.data.frame(netphorest = ~uniprot + site, CX1h, FUN = mean,  : 
'by' must be a list

1 Answers1

0

You could for example use:

aggregate(CX1h$netphorest, list(CX1h$uniprot,ddd$site), max)

(EDIT: as suggested in the comments)

or use a combination of with(),which(), ave() and max()to subset the rows with maximum netphorest values.

Julian Wittische
  • 1,219
  • 14
  • 22