Merging two massive tables based on common rows

Question

I have two large data tables (or will have them, I still need to get them into the same format) containing genetic SNP data.

These are humongous tables, so anything I do with them I have to do on the cluster.

Both tables have >100,000 rows that contain data for different, but overlapping SNPs. Each column is an individual human (one table has over 900 samples, one has >80). Once the other table is properly formatted, both tables will look like this

dbSNP_RSID  Sample1 Sample2 Sample3 Sample4 Sample5
rs1000001   CC  CC  CC  CC  TC
rs1000002   TC  TT  CC  TT  TT
rs1000003   TG  TG  TT  TG  TG

I want to make a large table with a >1000 columns and that has the intersection of the >100,000 rows represented in both tables. R seems like a good language to use. Anyone have any suggestions on how to do this? Thanks!

I meant 'intersection'. If you want to put both tables together side by side into one larger table use `cbind(table1, table2)`. The number of rows must be exactly the same. And it will look like a data frame, but it will be considered a `matrix` in R parlance. Depending on what you will do next that may be all you need. — Pierre L, Jun 11 '15 at 23:08
Will the rows be in the same order, or will they have to be matched up? — Pierre L, Jun 11 '15 at 23:12
both of the answers will solve it. `merge(table1, table2, by='dbSNP_RSID')` Make sure to put the `by` argument in quotes, I always forget that. — Pierre L, Jun 11 '15 at 23:18

score 5 · Answer 1 · answered Jun 11 '15 at 23:12

5

Use data.table, where DT1 is the first table, DT2 is the second:

library(data.table)
setkey(DT1,"id")
setkey(DT2,"id")
DT <- merge(DT1,DT2,by = "id")

answered Jun 11 '15 at 23:12

Chris

6,302
1
27
54

score 5 · Accepted Answer · answered Jun 11 '15 at 23:16

You could just use merge, for example:

mergedTable <- merge(table1, table2, by = "dbSNP_RSID")

If your samples have overlapping column names, then you'll find that the mergedTable has (for example) columns called Sample1.x and Sample1.y. This can be fixed by renaming the columns before or after the merge.

Reproducible example:

x <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
  matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
    sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
y <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
  matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
    sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
colnames(x)[2:101] <- paste0("Sample", 1:100)
colnames(y)[2:101] <- paste0("Sample", 101:200)
mergedDf <- merge(x, y, by = "dbSNP_RSID")

You've read in your data with no headers. Each column will be named V1, V2, etc. You're then trying to merge using a column called "dbSNP_RSID" which doesn't exist. You need to name the columns appropriately. — Nick Kennedy, Jun 12 '15 at 20:50

Merging two massive tables based on common rows

2 Answers2