I have two large data tables (or will have them, I still need to get them into the same format) containing genetic SNP data.
These are humongous tables, so anything I do with them I have to do on the cluster.
Both tables have >100,000 rows that contain data for different, but overlapping SNPs. Each column is an individual human (one table has over 900 samples, one has >80). Once the other table is properly formatted, both tables will look like this
dbSNP_RSID Sample1 Sample2 Sample3 Sample4 Sample5
rs1000001 CC CC CC CC TC
rs1000002 TC TT CC TT TT
rs1000003 TG TG TT TG TG
I want to make a large table with a >1000 columns and that has the intersection of the >100,000 rows represented in both tables. R seems like a good language to use. Anyone have any suggestions on how to do this? Thanks!