I have a huge data.frame with ~ 3 million rows, and 100 columns. In one of those columns there is information about IDs. I need to create a R script that can be used to produce a new data.frame based on that ID. Basically this new data.frame would only contain the rows where this ID is found and the rest of columns from the big data.frame plus a few extra columns that need to be created based on the info included in the ID column
. The final idea is implement this into a shiny app, so the user could type the name ID and then the new data.frame would be visualised.
Here is an example of how my data looks. This would be the big data.frame that I need to split by sample:
Chr Start End Ref Alt Callers GATK_Illumina.counts GATK_Illumina.samples GATK_SOLiD.counts GATK_SOLiD.samples LIFE_SOLiD.counts LIFE_SOLiD.samples TVC_Ion.counts TVC_Ion.samples Func.refGene
1 14653 14653 C T GATK_Illumina 5.38 17J965(het)23;19;4;VQSRTrancheSNP99.90to100.00|17L166(het)10;5;5;VQSRTrancheSNP99.00to99.90|1H321(het)7;4;3;VQSRTrancheSNP99.90to100.00|1K210(het)15;9;6;VQSRTrancheSNP99.00to99
1 14677 14677 G A GATK_Illumina 2.38 1H321(het)16;6;10;VQSRTrancheSNP99.90to100.00|1K210(het)24;18;6;VQSRTrancheSNP99.90to100.00 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_ex
1 14815 14815 C T GATK_Illumina 1.38 1H321(het)14;12;2;VQSRTrancheSNP99.90to100.00 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_exonic;downstream WASH7P;DDX11L1 dist=406
1 14825 14825 G A GATK_Illumina 1.38 1H321(het)13;11;2;VQSRTrancheSNP99.90to100.00 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_exonic;downstream WASH7P;DDX11L1 dist=416
1 14907 14907 A G GATK_Illumina 6.38 17J965(het)57;40;17;VQSRTrancheSNP99.90to100.00|17L166(het)26;15;11;VQSRTrancheSNP99.00to99.90|1H321(het)27;14;13;VQSRTrancheSNP99.90to100.00|1K210(het)42;24;18;VQSRTrancheSNP9
1 14930 14930 A G GATK_Illumina 6.38 17J965(het)82;60;22;VQSRTrancheSNP99.90to100.00|17L166(het)38;23;15;VQSRTrancheSNP99.00to99.90|1H321(het)31;17;14;VQSRTrancheSNP99.00to99.90|1K210(het)52;28;24;VQSRTrancheSNP99
1 14933 14933 G A GATK_Illumina 2.38 17J965(het)88;76;12;VQSRTrancheSNP99.90to100.00|5G540B(het)77;57;20;VQSRTrancheSNP99.90to100.00 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_in
1 14948 14948 G A GATK_Illumina 1.38 5G540B(het)75;63;12;VQSRTrancheSNP99.90to100.00 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_intronic;downstream WASH7P;DDX11L1 dist=539
1 14976 14976 G A GATK_Illumina 1.38 5G540B(het)62;50;12;VQSRTrancheSNP99.90to100.00 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_exonic;downstream WASH7P;DDX11L1 dist=567
1 15903 15903 - C GATK_Illumina 1.38 1K210(hom)2;0;2;VQSRTrancheINDEL99.00to99.90 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_exonic WASH7P ncRNA_ex
1 16495 16495 G C GATK_Illumina 5.38 17L166(het)80;57;23;VQSRTrancheSNP99.90to100.00|1H321(het)48;21;27;VQSRTrancheSNP99.90to100.00|1K210(het)59;38;21;VQSRTrancheSNP99.90to100.00|5G540B(het)95;77;18;VQSRTrancheSNP
1 16497 16497 A G GATK_Illumina 5.38 17J965(het)54;37;17;VQSRTrancheSNP99.90to100.00|17L166(het)74;61;13;VQSRTrancheSNP99.90to100.00|1K210(het)48;39;9;VQSRTrancheSNP99.90to100.00|5G540B(het)86;66;20;VQSRTrancheSNP
1 16534 16534 C T GATK_Illumina 5.38 17J965(het)19;12;7;VQSRTrancheSNP99.90to100.00|17L166(het)10;4;6;VQSRTrancheSNP99.90to100.00|1K210(het)8;4;4;VQSRTrancheSNP99.90to100.00|5G540B(het)14;7;7;VQSRTrancheSNP99.90to
1 16571 16571 G A GATK_Illumina 6.38 17J965(het)55;31;24;VQSRTrancheSNP99.90to100.00|17L166(het)47;16;31;VQSRTrancheSNP99.00to99.90|1H321(het)49;30;19;VQSRTrancheSNP99.90to100.00|1K210(het)52;18;34;VQSRTrancheSNP9
1 16580 16580 C G GATK_Illumina 1.38 6K141(het)43;36;7;VQSRTrancheSNP99.90to100.00 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_intronic;downstream WASH7P;MIR6859-1;MIR6859
Here is a sample from my original data.frame https://www.dropbox.com/s/jfmv6npiiu8n6zv/big_df.txt?dl=0
And this would be the new data.frame when the user selects the 17J965
ID
Chr Start End Ref Alt Callers GATK_Illumina.counts GATK_Illumina.Zygosity GATK_Illumina.Depth GATK_Illumina.RefCount GATK_Illumina.AltCount GATK_Illumina.Filter GATK_SOLiD.counts GATK_SOLiD.Zygosity
1 14653 14653 C T GATK_Illumina 5.38 het 23 19 4 VQSRTrancheSNP99 0.125 - - - - - 0.125 - - - - - 0.430 - -
1 14907 14907 A G GATK_Illumina 6.38 het 57 40 17 VQSRTrancheSNP99 0.125 - - - - - 0.125 - - - - - 0.430 - -
1 14930 14930 A G GATK_Illumina 6.38 het 82 60 22 VQSRTrancheSNP99 0.125 - - - - - 0.125 - - - - - 0.430 - -
1 14933 14933 G A GATK_Illumina 2.38 het 88 76 12 VQSRTrancheSNP99 0.125 - - - - - 0.125 - - - - - 0.430 - -
1 16497 16497 A G GATK_Illumina 5.38 het 54 37 17 VQSRTrancheSNP99 0.125 - - - - - 0.125 - - - - - 0.430 - -
1 16534 16534 C T GATK_Illumina 5.38 het 19 12 7 VQSRTrancheSNP99 0.125 - - - - - 0.125 - - - - - 0.430 - -
1 16571 16571 G A GATK_Illumina 6.38 het 55 31 24 VQSRTrancheSNP99 0.125 - - - - - 0.125 - - - - - 0.430 - -
Here is the link to the result ID dataframe, https://www.dropbox.com/s/2nfjud7xkb3b6mc/17J965.txt?dl=0
EDIT 1
I have several problems that I don't know how to solve:
1) How to identify the ID? The ID is always linked to the Callers
column, i.e, if the caller is GATK_Illumina
then the ID will be in the GATK_Illumina.samples
, if the Callers
column is GATK_Illumina,GATK_SOLID
then the the ID can be in two columns GATK_Illumina.samples
and GATK_SOLID.samples
. This gets more complicated, as you can see from the big data.frame, since from the alphanumeric ID there another values: The format for the sample is always the same an alphanumeric code
then a bracket
, hom
, or het
, another bracket
and then 3 values separated by a semicolon
, a character vector
, and then a pipe
if there are another IDs info for that row. E.g: 17J965(het)23;19;4;VQSRTrancheSNP99.90to100.00|17L166(het)10;5;5;VQSRTrancheSNP99.00to99.90|1H321(het)7;4;3;VQSRTrancheSNP99.90to100.00|1K210(het)15;9;6;VQSRTrancheSNP99.00to99
. Different ID info is separated by |
and the format is always the same. In this example, there is only one column with ID info since the Caller
column only has the value GATK_Illumina
but this could get complicated with three different caller values.
2) Once you identified the rows that belong to that ID, how to put the information together? It is just get a grep from the big data.frame and then a rbind? or a subset based on the ID
3) In the child ID dataframe there are a few columns that need to be created based on the ID column and on the Caller
column, for example:
# ID column for row 1 only for value of `Caller` column `GATK_Illumina`:
17J965(het)23;19;4;VQSRTrancheSNP99.90to100.00|17L166(het)10;5;5;VQSRTrancheSNP99.00to99.90|1H321(het)7;4;3;VQSRTrancheSNP99.90to100.00|1K210(het)15;9;6;VQSRTrancheSNP99.00to99
If I want to create the new data.frame for ID 17J965
the new columns to be created (as in the example showed before) would be:
GATK_Illumina.Zygosity
, GATK_Illumina.Depth
, GATK_Illumina.RefCount
, GATK_Illumina.AltCount
, GATK_Illumina.Filter
, GATK_SOLiD.Zygosity
, GATK_SOLiD.Depth
, GATK_SOLiD.RefCount
, GATK_SOLiD.AltCount
, GATK_SOLiD.Filter
An the values after the ID would fill these columns, like this:
GATK_Illumina.Zygosity GATK_Illumina.Depth GATK_Illumina.RefCount GATK_Illumina.AltCount GATK_Illumina.Filter GATK_SOLiD.Zygosity GATK_SOLiD.Depth GATK_SOLiD.RefCount GATK_SOLiD.AltCount GATK_SOLiD.Filter
het 23 19 4 VQSRTrancheSNP99 - - - - -
Note that the columns are filled according to the Callers
column, in this example the Callers
column is GATK_Illumina
then only the columns created de novo with GATK_Illumina
would be filled in, for the rest a -
or NA
values should be used.
What I have got so far is:
# Let's suppose that I want the ID: 17J965 and the big data.frame is call `big_df.txt`
big_df <- read.delim("big_df.txt")
sample <- grep("17J965", test2a$GATK_Illumina.samples)
df_sample <- big_df[sample,]
# df_sample has all the rows containing 17J965, but now I want to create the new data.frame with the extra columns and only select the correct values if there are more IDs
How can I extract all the relevant info from the column ID
# I know that I can get the ID using this command
samples <- sub("\\(.*", "", b)
But what if my ID of interest is on the second pipe?
Thanks