0

I have an expression matrix, that is, a matrix which contains the expression levels of some genes in different human samples and there are some samples that are replicates, so I need to combine the expression in those replicates and calculate a median. I have the name of the samples as rows and in each column I have the expression of a gene. (I have around 200,000 genes, so ~200,000 columns). The first column look like this:

Adipocyte - breast, donor1
Adipocyte - breast, donor2
Adipocyte - omental, donor1
Adipocyte - omental, donor2
Adipocyte - omental, donor3
Alveolar Epithelial Cells, donor1
Alveolar Epithelial Cells, donor2
Amniotic Epithelial Cells, donor1
Amniotic Epithelial Cells, donor3

The rest of the columns correspond to numbers (expression of the different genes).

So I think I would need to first write a regular expression that grabs those rows that are equal until the coma so that it catches the different donors for the same cell type and then calculate the median for the expression of each gene.

Any ideas of how to do this?

newa123
  • 99
  • 8
  • 1
    are these the rownames and the rest of the matrix is numeric? or is this the first column in which case your matrix will not be numeric? or do you have a data frame instead? – rawr Dec 12 '15 at 20:14
  • It is a column, not the rownames. The row.names is another column with the code of each sample. (But I could change that). The rest is numeric – newa123 Dec 13 '15 at 11:30

1 Answers1

2

Here is a less elegant solution (mostly because of the string split function "strsplit") but it does not require any additional package and is easier to understand since the syntax is more familiar to R users (the previous solution is using packages written by Hadley Wickham, I believe, who is using slightly different grammar).

# Dummy data
dat <- data.frame(tissue = c("Adipocyte - breast, donor1", 
                             "Adipocyte - breast, donor2", 
                             "Adipocyte - omental, donor1", 
                             "Adipocyte - omental, donor2",
                             "Adipocyte - omental, donor3", 
                             "Alveolar Epithelial Cells, donor1",
                             "Alveolar Epithelial Cells, donor2", 
                             "Amniotic Epithelial Cells, donor1",
                             "Amniotic Epithelial Cells, donor3"),
                  val1 = rnorm(9),
                  val2 = rnorm(9),
                  val200000 = rnorm(9))


# Use "aggregate" function form the default "stats" package
aggregate(x = dat[2:ncol(dat)],
          by = list(factor(do.call("c", 
                                   lapply(strsplit(x = as.character(dat$tissue), 
                                                   split = ","),
                                          function(a)a[1])))),
          FUN = "median")
Davit Sargsyan
  • 1,264
  • 1
  • 18
  • 26
  • 1
    you can simplify your `by` to `list(unlist(strsplit(x = as.character(dat$tissue), split = ",.*$")))` or just `list(gsub('(.*)(,.*)', '\\1', dat$tissue))` – rawr Dec 12 '15 at 20:13