1

I have five data.frames with gene expression data for different sets of samples. I have a different number of rows in each data.set and therefore only partly overlapping row.names (genes).

Now I want a) to filter the five data.frames to contain only genes that are present in all data.frames and b) to combine the gene expression data for those genes to one data.frame.

All I could find so far was merge, but that can only merge two data.frames, so I'd have to use it multiple times. Is there an easier way?

Lilith-Elina
  • 1,613
  • 4
  • 20
  • 31

2 Answers2

5

Merging is not very efficient if you want to exclude row names which are not present in every data frame. Here's a different proposal.

First, three example data frames:

df1 <- data.frame(a = 1:5, b = 1:5, 
                  row.names = letters[1:5]) # letters a to e
df2 <- data.frame(a = 1:5, b = 1:5, 
                  row.names = letters[3:7]) # letters c to g
df3 <- data.frame(a = 1:5, b = 1:5, 
                  row.names = letters[c(1,2,3,5,7)]) # letters a, b, c, e, and g
# row names being present in all data frames: c and e

Put the data frames into a list:

dfList <- list(df1, df2, df3)

Find common row names:

idx <- Reduce(intersect, lapply(dfList, rownames))

Extract data:

df1[idx, ]

  a b
c 3 3
e 5 5

PS. If you want to keep the corresponding rows from all data frames, you could replace the last step, df1[idx, ], with the following command:

do.call(rbind, lapply(dfList, "[", idx, ))
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
0

Check out the uppermost answer in this SO post. Just list your data frames and apply the following line of code:

Reduce(function(...) merge(..., by = "x"), list.of.dataframes)

You just have to adjust the by argument to specify by which common column the data frames should be merged.

Community
  • 1
  • 1
fdetsch
  • 5,239
  • 3
  • 30
  • 58
  • I'm afraid it's not that easy. Merge creates a new column with the row names and when trying to add the next data.frame, complains about that. `1: In merge.data.frame(..., by = 0) : column name ‘Row.names’ is duplicated in the result 2: In merge.data.frame(..., by = 0) : column names ‘Row.names’, ‘Row.names’ are duplicated in the result 3: In merge.data.frame(..., by = 0) : column names ‘Row.names’, ‘Row.names’, ‘Row.names’ are duplicated in the result` – Lilith-Elina May 29 '13 at 08:27
  • How do you import your data? Maybe you could just skip setting `row.names` which would result in an additional column containing these previous rownames. Guess it should work then! – fdetsch May 29 '13 at 08:31
  • Nice try, but no. `Warnmeldungen: 1: In merge.data.frame(..., by = 0) : column names ‘Row.names.x’, ‘Row.names.y’ are duplicated in the result 2: In merge.data.frame(..., by = 0) : column names ‘Row.names.x’, ‘Row.names.x’, ‘Row.names.y’, ‘Row.names.y’ are duplicated in the result 3: In merge.data.frame(..., by = 0) : column names ‘Row.names.x’, ‘Row.names.x’, ‘Row.names.x’, ‘Row.names.y’, ‘Row.names.y’, ‘Row.names.y’ are duplicated in the result` – Lilith-Elina May 29 '13 at 09:03
  • Ah, just tested it. Seems like applying `Reduce` in combination with `merge` reveals some serious weaknesses when trying to merge more than two dataframes. Sorry, shit happens ;-) – fdetsch May 29 '13 at 10:10
  • No problem, Sven Hohenstein had a great suggestion that works perfectly. And we learned something new. ;-) – Lilith-Elina May 29 '13 at 11:00