Merge on columns with multiple values

Question

I have a data frame, cluster, and one of the columns, cluster$Genes, looks like this:

ENSG00000134684
ENSG00000188846, ENSG00000181163, ENSG00000114391
ENSG00000134684, ENSG00000175390
ENSG00000134684
ENSG00000134684, ENSG00000175390
...

The number of elements in each row in the column is arbitrary. I also have another data frame, expression, that looks like this:

ENSGID           a       b
ENSG00000134684  1       3
ENSG00000175390  2       0
ENSG00000000419  131.23  108.73
ENSG00000000457  7.11    8.68
ENSG00000000460  15.70   6.59
ENSG00000000938  0       0
ENSG00000000971  0.03    0.07
ENSG00000001036  59.22   58.3
...

... and has around 20000 rows. What I want to do, is this:

For all the elements in each row in the cluster$Genes, find the corresponding a and b values
Calculate the min, max and mean values of a and b (separately) for each row in cluster$Genes
Create six new columns in the cluster data frame and fill them with the (min.a, max.a, mean.a, min.b, max.b, mean.b) values

I've tried to find some way of doing this, but it's not going well. While googling for help I thought I might use some kind of apply, and I got some code going. I think it's mostly gibberish and totally nonfunctional, and I'm kind of stuck. This is what I got:

exp.lookup = function(genes) {
  genes.split = strsplit(genes, ', ')
  exp.hct = list()
  exp.hke = list()
  for ( gene in genes.split ) {
    exp.hct = c(exp.hct, merge(gene, means$hct, all.x=TRUE))
    exp.hke = c(exp.hke, merge(gene, means$hke, all.x=TRUE))
    return(c(exp.hct, exp.hke))
  }
}

apply(cluster['Genes'], 1, FUN=exp.lookup)

Anybody got any better ideas, that might actually work?

Please make your example reproducible. See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example on how to do that. You can share data using `dput` function. — Roman Luštrik, Apr 15 '15 at 12:54

Krome · Accepted Answer · 2015-04-16T14:55:41.093

Recreating initial data:

library(data.table)

cluster<- as.data.table(list(Genes = c("ENSG00000134684",
                                       "ENSG00000188846, ENSG00000181163, ENSG00000114391", 
                                       "ENSG00000134684, ENSG00000175390", 
                                       "ENSG00000134684", 
                                       "ENSG00000134684, ENSG00000175390")))

expression<- as.data.table(list(ENSGID = c("ENSG00000134684", "ENSG00000175390",
                                           "ENSG00000000419", "ENSG00000000457",
                                           "ENSG00000000460", "ENSG00000000938",
                                           "ENSG00000000971", "ENSG00000001036"),
                                a = c(1,2,131.23,7.11,15.70, 0, 0.03, 59.22),
                                b = c(3,0,108.73,8.68,6.59,0,0.07,58.3)))
setkey(cluster, Genes)
setkey(expression, ENSGID)

Solution:

library(data.table)

result<- function() {
  colnames<- c("min.a", "max.a", "mean.a", "min.b", "max.b", "mean.b")
  # 1. "(colnames)" is parenthesized to insure we are adding new columns from
  # colnames variable by reference and evaluates to character vector with 
  # new columns names
  # 2. ":=" is for adding new columns to existing data.table by reference
  # 3. "count(Genes)" calls count() function over "Genes" column, but as long
  # as we are using grouping "by = Genes", count() works with each row turn
  # by turn. And each row is a character vector.
  cluster[,(colnames):=count(Genes), by = Genes]
}

# get Genes row
count<- function(charvector) {
  ENSGIDc<- strsplit(charvector, ", ")
  # 4. subsetting "expression" data.table rows by splitted "Genes" character 
  # vector named "ENSGIDc"...
  # 5. ... and then calculating column's maxes, mins and means
  expression[ENSGIDc, .(min(a, na.rm = T), max(a, na.rm = T),
                        mean(a, na.rm = T), min(b, na.rm = T), 
                        max(b, na.rm = T), mean(b, na.rm = T))]
  # 6. at this point we are returning resulting 1 row 6 columns data.table     
  # back to calling function, where it's added to "cluster" data.table
}

suppressWarnings(result())

That works wonderfully, thanks a lot! I don't really understand all the details of the code, though... I don't understand the second lines of either function; could you explain, please? It would be nice to not only have working code, but knowing how it functions as well ;-) — erikfas, Apr 16 '15 at 09:43
Sure! You might want to start with introduction to data.table: https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro-vignette.html to get familiar with base syntax. Data Camp's data.table cheat sheet: http://blog.datacamp.com/data-table-cheat-sheet/ might be of great help as well. And i'll edit my post to add more comments. — Krome, Apr 16 '15 at 10:22
Updated now, feel free to ask questions. I might messed up with terminology, but i hope main idea is still clear. — Krome, Apr 16 '15 at 13:51

zhaoy · Answer 2 · 2015-04-15T13:16:32.740

Assuming each ENSGID corresponds to a unique pair of a and b values, I suggest:

Assign cluster$Genes to a variable (in other words, make a copy of it that is outside of the cluster data-frame). For example, new_cluster_genes <- cluster$Genes
Manipulate new_cluster_genes so that each row has a single ENSGID. Add a column header called ENSGID.
Merge new_cluster_genes with the expression data-frame, using ENSGID as the common ID. Assign the resulting data-frame to a variable: for example, merged_genes.

Calculate the min, max and mean values of a and b (separately) for each row:

library(dplyr) 
merged_genes %>% 
mutate(min.a = min(a),
       max.a = max(a), 
       mean.a = mean(a), 
       min.b = min(b), 
       max.b = max(b), 
       mean.b = mean(b)) -> merged_genes

Create 6 new columns and fill them with the (min.a, max.a, mean.a, min.b, max.b, mean.b) values:
```
merged_genes %>% select(ENSGID, min.a:mean.b) -> merged_genes_subset
```

Manipulate the cluster data-frame so that each row has a single ENSGID. Add a column header called ENSGID. Merge merged_genes_subset with cluster, using ENSGID as the common ID.

Thanks for your reply! I'm not sure if I understand it correctly, but wouldn't doing (2) make me lose the information of which IDs (i.e. ENSGID) are in each row of the `cluster$Genes`? I want the various calculations of `a` and `b` for all the IDs of each row, not separately. For example, in the last row, [min,max,mean](a / b) = [1,2,1.5](a) / [0,3,1.5](b) — erikfas, Apr 15 '15 at 14:05

Merge on columns with multiple values

2 Answers2