How can I subset from a data frame a value in a column that matches criteria from multiple identical entries?

Question

I have a dataset set which looks like this:

   uniprot site netphorest
1   C9J0A7  169   0.064921
3   C9J0A7  169   0.063045
4   C9J0A7  169   0.055366
9   C9J0A7  169   0.055366
10  C9J0A7  169   0.055366
11  C9J0A7  169   0.055577
14  C9J0A7  169   0.054875
15  C9J0A7  169   0.054875
16  C9J0A7  169   0.054875
22  C9J0A7  169   0.430742
23  C9J0A7  169   0.430742

multiple entries for the same uniprot identifier and modification site, but each entry has multiple netphorest scores (the likelihood of it being modified by a particular enzyme) and over 42,000 observations. essentially i want to select the highest score for a particular uniprot/site row.

I have tried to do something like this (1hCX is my data frame)

CX1href <- subset.data.frame(CX1h, netphorest = max)

where I am trying to subset the the rows based on the largest variable in the netphorest column, however, my new data frame still contains the same number of entries as the original data frame. Not sure how to approach this issue as I have multiple entries with the same uniprot code and site number...

I tried this out and got this error:

CX1href <- aggregate.data.frame(netphorest = ~ uniprot + site, CX1h, FUN = mean, max)
Error in aggregate.data.frame(netphorest = ~uniprot + site, CX1h, FUN = mean,  : 
'by' must be a list

Maybe try something like `aggregate(netphorest ~ uniprot + site, CX1h, max)`.... .(Or your favorite aggregation function). — A5C1D2H2I1M1N2O1R2T1, Sep 09 '15 at 18:38
Or, `library(data.table); setDT(CX1h)[,max(netphorest), by=list(uniprot,site)]` — jlhoward, Sep 09 '15 at 18:45
this question of selecting the max/mean/sum per particular column(s) value(s) has been asked and answered many many times on SO... try to search for it before asking ;) — Colonel Beauvel, Sep 09 '15 at 19:06

Julian Wittische · Accepted Answer · 2015-09-09T19:59:34.267

0

You could for example use:

aggregate(CX1h$netphorest, list(CX1h$uniprot,ddd$site), max)

(EDIT: as suggested in the comments)

or use a combination of with(),which(), ave() and max()to subset the rows with maximum netphorest values.

edited Sep 09 '15 at 19:59

answered Sep 09 '15 at 19:53

Julian Wittische

1,219
14
22

yes this is what i am after! thanks! – Adam Rabalski Sep 09 '15 at 20:26

How can I subset from a data frame a value in a column that matches criteria from multiple identical entries?

1 Answers1