I have an expression matrix, that is, a matrix which contains the expression levels of some genes in different human samples and there are some samples that are replicates, so I need to combine the expression in those replicates and calculate a median. I have the name of the samples as rows and in each column I have the expression of a gene. (I have around 200,000 genes, so ~200,000 columns). The first column look like this:
Adipocyte - breast, donor1
Adipocyte - breast, donor2
Adipocyte - omental, donor1
Adipocyte - omental, donor2
Adipocyte - omental, donor3
Alveolar Epithelial Cells, donor1
Alveolar Epithelial Cells, donor2
Amniotic Epithelial Cells, donor1
Amniotic Epithelial Cells, donor3
The rest of the columns correspond to numbers (expression of the different genes).
So I think I would need to first write a regular expression that grabs those rows that are equal until the coma so that it catches the different donors for the same cell type and then calculate the median for the expression of each gene.
Any ideas of how to do this?