2

This question is very similar to Select the top N values by group but with a variation I can't seem to solve. I have a large frequency table with 3 columns and over 50 rows. Each row is independent and may the same value for a particular column. Example:

       CD     Freq    sample
 express 1    0.2        A
 express 2    0.7        A
 express 2    0.5        B
 express 3    0.1        B
 express 4    0.9        A
 express 4    0.3        C
 express 5    0.8        B
 express 5    0.1        C

I'd like to generate a "top 10" frequency table for each sample where the rows are consolidated by each CD, and the columns are each sample (A, B, C etc), and in the intersection the frequency. Example:

             A    B    C
express 1   0.2   0    0
express 2   0.7  0.5   0
express 3   0    0.1   0
express 4   0.3   0    0.9
express 5   0     0.8  0.1

I've generated the original table and can sort, but I don't know how to aggregate the frequencies and list them in the column by sample. But that's as far as I can wrap my newbie skills around. Any help is greatly appreciated.

data <- vdj.data %>% map_dfr(read_delim, "\t", escape_double = FALSE, trim_ws = TRUE, .id ="source") %>% mutate(source = str_replace(basename(source), ".txt", ""))
data <- vdj.data %>% select(freq, cd, "file"= source)


require(data.table)
d <- data.table(vdj.data, key="cd")

top <- d[, head(.SD, 10), by=cd]

Expected Output should display only the top ten express by frequency in each sample, so it may require that some express may have to be consolidated, for example, say I wanted to pick the top 2 instead of top 10. I would produce the table:

                        A       B       C
            express 2   0.7    0.5     0
            express 4   0.9     0       0.3
            express 5   0       0.8     0.1

So as you can see, express 1 and express 3 are dropped because no sample has it as its top 2 frequencies. The top 2 for A are express 2 and express 4, the top 2 for B are express 2 and express 5, and finally, the top 2 for sample C are expression 4 and 5.

I hope this helps clarify.

PavoDive
  • 6,322
  • 2
  • 29
  • 55
Lou_A
  • 249
  • 1
  • 11
  • Can you show the expected output – akrun Sep 17 '19 at 19:04
  • It's the table at the bottom of the code section. – Lou_A Sep 17 '19 at 19:05
  • Is that the output of `top` – akrun Sep 17 '19 at 19:07
  • Yes. The one with express 1.. on the row and columns with A, B, C.. I just added a clarification to make sure it's not confused with the input table. – Lou_A Sep 17 '19 at 19:16
  • I'm confused by what you mean with "*only the top ten express by frequency in each sample, so it may require that some express may have to be consolidated*". How would the rest be consolidated? Add them? Average them? Drop them? Something else? Your example data only contains 5 different express values, with at most 3 in the same sample, so it doesn't show what you mean. Maybe you could edit your example data so there are 4 or 5 expresses in some samples, and show what the result would be for "top 3" instead of "top 10"? – Gregor Thomas Sep 17 '19 at 19:34
  • The remainder would be dropped. One problem I run into is that the top 10 frequencies for a particular sample may not meet the "top 10" threshold of another, so I need to add that expression in the list of rows. So, in effect, some samples will have expressions with frequency values below the "top 10" for that sample. Putting it another way, each sample must have at least its top 10 frequency values in the list of expressions on the left. – Lou_A Sep 17 '19 at 19:41
  • I edited the original post to show the output desired using only the top 2 frequencies per sample. My original example was too small to include the top 10. Sorry. – Lou_A Sep 17 '19 at 19:55

2 Answers2

2

Using dplyr and reshape2, we get the top N by group and then convert long to wide. Feel free to use your favorite long to wide method from the FAQ

n = 2
df %>% group_by(sample) %>%
  arrange(desc(Freq)) %>% 
  slice(1:n) %>% 
  reshape2::dcast(CD ~ sample, value.var = "Freq", fill = 0)
#          CD   A   B   C
# 1 express 2 0.7 0.5 0.0
# 2 express 4 0.9 0.0 0.3
# 3 express 5 0.0 0.8 0.1

Keeping top n anywhere:

df %>% group_by(sample) %>%
  arrange(desc(Freq)) %>%
  slice(1:n) %>%
  ungroup %>%
  select(CD) %>% 
  inner_join(df) %>%
  reshape2::dcast(CD ~ sample, value.var = "Freq", fill = 0,
                  fun.aggregate = mean)

Using this data:

df = read.table(text = "       CD     Freq    sample
 'express 1'  0.2        A
 'express 2'    0.7        A
 'express 2'  0.5        B
 'express 3'    0.1        B
 'express 4'  0.9        A
 'express 4'    0.3        C
 'express 5'  0.8        B
 'express 5'    0.1        C", header = T)
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • I'll give it a try. – Lou_A Sep 17 '19 at 20:05
  • Gregor, it seems to work. I have a question. If I wanted to expanded to say the top 3 (or another value), that's what "slice" indicates? So if I wanted the top 3 instead, I would write slice(1:3)? – Lou_A Sep 17 '19 at 20:14
  • Yeah - after the initial post I parametrized `n`, it's now defined in the first line. But your assumption is correct. – Gregor Thomas Sep 17 '19 at 20:15
  • Another comment. I noticed that output generates only 2 frequencies for each sample (I imagine these are the top 2), but I think some of these samples have express frequencies that although may not be in the top 2, the frequency still should be displayed in that sample's column if the express is in the top 2 of another sample. Any thoughts how to accomplish this? – Lou_A Sep 17 '19 at 20:19
  • @Gregor okay, got it. Thanks – akrun Sep 17 '19 at 20:24
  • Your solution works well for whatever top value I select. I'm just trying to anticipate a question that I may be asked regarding the output table. – Lou_A Sep 17 '19 at 20:25
  • Added a version for that. – Gregor Thomas Sep 17 '19 at 20:27
  • After I used your last version, for some reason, now I'm only getting for frequency values 0, 1, or 2. The only additional piece of code is the inner_join. I wonder what's it doing. – Lou_A Sep 17 '19 at 20:40
  • 1
    Awesome! That worked like a charm. Thanks Gregor. – Lou_A Sep 17 '19 at 20:50
2

Since you included the library, here's a similar route:

library(data.table)

setorder(dt, -Freq)

n = 2
dt[dt[, .I[1:n], by = sample]$V1,
   xtabs(Freq ~ CD + sample)]
#>           sample
#> CD           A   B   C
#>   express2 0.7 0.5 0.0
#>   express4 0.9 0.0 0.3
#>   express5 0.0 0.8 0.1

#or

dcast(data = dt[dt[, .I[1:n], by = sample]$V1],
      CD ~ sample,
      value.var = 'Freq',
      fill = 0L)
#>          CD   A   B   C
#> 1: express2 0.7 0.5 0.0
#> 2: express4 0.9 0.0 0.3
#> 3: express5 0.0 0.8 0.1

Created on 2019-09-18 by the reprex package (v0.3.0)

Data:

dt <- fread(
'CD     Freq    sample
express1    0.2        A
express2    0.7        A
express2    0.5        B
express3    0.1        B
express4    0.9        A
express4    0.3        C
express5    0.8        B
express5    0.1        C')
Cole
  • 11,130
  • 1
  • 9
  • 24