0

I have a 38467*59 data frame. Here is a subset of the whole dataset

mydata.2
   Unique.Groups Protein.group.IDs Intensity.R1 Intensity.R2
1            yes                 0      9701600     17971000
2            yes                 0            0     29308000
3            yes                 1     49083000     75300000
4            yes                 2     53926000     13926000
5            yes                 3     35255000     35750000
6            yes                 3    302600000    295410000
7            yes                 4     72766000     51914000
8            yes                 5     33657000     31027000
9            yes                 5     16504000     20689000
10           yes                 5    191170000    209230000
11           yes                 5    155710000    150280000
12           yes                 5    402890000    529300000
13           yes                 5    633480000    522180000
14           yes                 5     92339000            0
15           yes                 6            0     30646000
16           yes                 6     67494000     92326000
17           yes                 6     20205000            0
18           yes                 7     94183000     76848000
19           yes                 7     15584000            0
20           yes                 7     27247000     38961000
21           yes                 7     98433000    390870000
22           yes                 8            0     36542000
23           yes                 8     83978000            0
24           yes                 8     53744000     95173000
25           yes                 8            0     53494000

I applied the following piece of code to Intensity.R1 column and I'd like to apply it automatically to all the Intensities column of the data frame. At each round I need to run the same code on Intensity.R1, Intensity.R2, and so on.

     data.normalized <- filter(mydata.2, Unique.Groups == "yes") %>% 
  group_by(Protein.group.IDs) %>%
  arrange(desc(Intensity.R1)) %>% 
  top_n(3, Intensity.R1) %>%
  mutate(sum.top.3.R1 = sum(Intensity.R1)) %>%
  filter(!duplicated(Protein.group.IDs)) %>%
  drop.levels() %>%
  arrange(Protein.group.IDs)

Each round should generate a new column (i.e sum.top.3) that I want to bind together in a final table. Here is an example just for the Intensity.R1 column.

data.normalized    
Source: local data frame [9 x 5]
    Groups: Protein.group.IDs

      Unique.Groups Protein.group.IDs Intensity.R1 Intensity.R2 sum.top.3.R1
    1           yes                 0      9701600     17971000      9701600
    2           yes                 1     49083000     75300000     49083000
    3           yes                 2     53926000     13926000     53926000
    4           yes                 3    302600000    295410000    337855000
    5           yes                 4     72766000     51914000     72766000
    6           yes                 5    633480000    522180000   1227540000
    7           yes                 6     67494000     92326000     87699000
    8           yes                 7     98433000    390870000    219863000
    9           yes                 8     83978000            0    137722000
  • 4
    Can you please make your example reproducible while also providing the desired output (that will fit the provided data set *exactly*)? See [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Also, please keep it small as we don't really need a `38467*59` matrix in order to offer a solution. Finally, I would suggest not to use `dplyr` on matrices. `dplyr` is designed to work on `data.frame`s, if you have a big matrix and you are looking for efficiency, you should think vectorization/matrix manipulations. – David Arenburg Aug 09 '15 at 08:10
  • Hi David, thanks for you comment. I actually made a mistake my data is a data.frame object. I edited my original post, hope it will match your requirements ( this is my first post actually..) – Raphaël Bilgraer Aug 09 '15 at 08:27
  • Does your desired output matches the provided example or these two are just `head` calls? Also, please describe in words what you are actually trying to achieve and what is wrong with the already existing code – David Arenburg Aug 09 '15 at 08:28
  • These two are just head calls. But the piece of code I provided is working and give me what I want. I just wanted to repeat it over several columns of my data frame and bind the resulting columns into the same table. In other words I'm wondering if a for loop would work here, or any apply() function... – Raphaël Bilgraer Aug 09 '15 at 08:34
  • So you basically didn't follow the instructions. We need a fully reproducible example and *exactly* matching desired output. We need a word description of the problem, then your attempt and explanation what doesn't work. If for example, you want to create 50 additional columns, you could reduce it to 3 for that matter, and then you should be able to expand a possible solution to your real data. Otherwise, good luck with this questions and I hope some one else will help you. – David Arenburg Aug 09 '15 at 08:37
  • Sorry for misunderstanding the instructions. Hope this is fixed now. Thanks for your time. – Raphaël Bilgraer Aug 09 '15 at 08:52
  • That is better, but I still don't understand the logic. How these three columns were created? Are you selecting maximum values in `Intensity.R1` and `Intensity.R2` and then summing top 3 in `Intensity.R1` per group? In other words, you haven't described in words what are you trying to achieve. For example, are you trying to get the sum of top 3 values in all `Intensity.R` columns? – David Arenburg Aug 09 '15 at 09:19
  • Try this for example and tell me if this is doing what you need (you may need to play around with column names afterwords)`library(data.table); res <- setDT(mydata.2)[Unique.Groups == "yes", c(lapply(.SD, function(x) x[which.max(x)]), lapply(.SD, function(x) sum(sort(x, decreasing = TRUE)[1:3], na.rm = TRUE))), .SDcols = grep("Intensity.R", names(mydata.2)), by = Protein.group.IDs]` – David Arenburg Aug 09 '15 at 09:32
  • I got the following error message: Error in assign("..FUN", eval(fun, SDenv, SDenv), SDenv) : cannot change value of locked binding for '..FUN' – Raphaël Bilgraer Aug 09 '15 at 10:03
  • What `data.table` version you have? – David Arenburg Aug 09 '15 at 10:05
  • I have the 1.9.4 version – Raphaël Bilgraer Aug 09 '15 at 10:09
  • Yes, it seem only to work on the [development version on GH (v 1.9.5+)](https://github.com/Rdatatable/data.table/wiki/Installation), either way, it seems like you accepted an answer which returns only half of your desired output, so you could just do `res <- setDT(mydata.2)[Unique.Groups == "yes", lapply(.SD, function(x) sum(sort(x, decreasing = TRUE)[1:3], na.rm = TRUE)), .SDcols = grep("Intensity.R", names(mydata.2)), by = Protein.group.IDs]` in order to get the same and will work with v 1.9.4 too. – David Arenburg Aug 09 '15 at 10:15
  • Yes it's also working very well. Actually it replaces the Intensity columns instead of creating a new column, but the intensity column is kind of useless after the group_by() function. Thanks again for your help! – Raphaël Bilgraer Aug 09 '15 at 10:19
  • It's not replacing them, its just saving them in a new data set `res` with the same column names. I've added `res <- ` to the comment. Pretty much the same as the `dplyr` solution does. Anyway, glad you found your solution. – David Arenburg Aug 09 '15 at 10:20

1 Answers1

1

I've created a simple example with same column names for you to modify.

library(dplyr)
library(gdata)

mydata.2 = data.frame(Unique.Groups = rep("yes",8),
                      Protein.group.IDs = c(1,1,1,1,2,2,2,2),
                      Intensity.R1 = c(10,40,20,30,40,50,70,60),
                      Intensity.R2 = c(70,40,50,60,40,10,20,30))

# function that orders a vector and keeps top 3 values and calculates the sum
ff = function(x) {x[order(-x)][1:3] %>% sum()}


filter(mydata.2, Unique.Groups == "yes") %>% 
  group_by(Protein.group.IDs) %>%
  select(-Unique.Groups) %>% # keep only grouping variable and the ones you need to calculate
  summarise_each(funs(ff))

Even in your code if you have used "summarise" instead of "mutate" you wouldn't have to to filter out duplicated ids. I don't know if I'm missing something with the drop.levels here. Also, i don't really think you needed the first "arrange" as top_n will keep the top 3, but not in the right order (which is not a problem here).

AntoniosK
  • 15,991
  • 2
  • 19
  • 32