create new data.frame based on values in another data.frame

Question

I have a huge data.frame with ~ 3 million rows, and 100 columns. In one of those columns there is information about IDs. I need to create a R script that can be used to produce a new data.frame based on that ID. Basically this new data.frame would only contain the rows where this ID is found and the rest of columns from the big data.frame plus a few extra columns that need to be created based on the info included in the ID column. The final idea is implement this into a shiny app, so the user could type the name ID and then the new data.frame would be visualised.

Here is an example of how my data looks. This would be the big data.frame that I need to split by sample:

Chr     Start   End     Ref     Alt     Callers GATK_Illumina.counts    GATK_Illumina.samples   GATK_SOLiD.counts       GATK_SOLiD.samples      LIFE_SOLiD.counts       LIFE_SOLiD.samples      TVC_Ion.counts  TVC_Ion.samples Func.refGene    
1       14653   14653   C       T       GATK_Illumina   5.38    17J965(het)23;19;4;VQSRTrancheSNP99.90to100.00|17L166(het)10;5;5;VQSRTrancheSNP99.00to99.90|1H321(het)7;4;3;VQSRTrancheSNP99.90to100.00|1K210(het)15;9;6;VQSRTrancheSNP99.00to99
1       14677   14677   G       A       GATK_Illumina   2.38    1H321(het)16;6;10;VQSRTrancheSNP99.90to100.00|1K210(het)24;18;6;VQSRTrancheSNP99.90to100.00     0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_ex
1       14815   14815   C       T       GATK_Illumina   1.38    1H321(het)14;12;2;VQSRTrancheSNP99.90to100.00   0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_exonic;downstream WASH7P;DDX11L1  dist=406        
1       14825   14825   G       A       GATK_Illumina   1.38    1H321(het)13;11;2;VQSRTrancheSNP99.90to100.00   0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_exonic;downstream WASH7P;DDX11L1  dist=416        
1       14907   14907   A       G       GATK_Illumina   6.38    17J965(het)57;40;17;VQSRTrancheSNP99.90to100.00|17L166(het)26;15;11;VQSRTrancheSNP99.00to99.90|1H321(het)27;14;13;VQSRTrancheSNP99.90to100.00|1K210(het)42;24;18;VQSRTrancheSNP9
1       14930   14930   A       G       GATK_Illumina   6.38    17J965(het)82;60;22;VQSRTrancheSNP99.90to100.00|17L166(het)38;23;15;VQSRTrancheSNP99.00to99.90|1H321(het)31;17;14;VQSRTrancheSNP99.00to99.90|1K210(het)52;28;24;VQSRTrancheSNP99
1       14933   14933   G       A       GATK_Illumina   2.38    17J965(het)88;76;12;VQSRTrancheSNP99.90to100.00|5G540B(het)77;57;20;VQSRTrancheSNP99.90to100.00 0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_in
1       14948   14948   G       A       GATK_Illumina   1.38    5G540B(het)75;63;12;VQSRTrancheSNP99.90to100.00 0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_intronic;downstream       WASH7P;DDX11L1  dist=539
1       14976   14976   G       A       GATK_Illumina   1.38    5G540B(het)62;50;12;VQSRTrancheSNP99.90to100.00 0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_exonic;downstream WASH7P;DDX11L1  dist=567        
1       15903   15903   -       C       GATK_Illumina   1.38    1K210(hom)2;0;2;VQSRTrancheINDEL99.00to99.90    0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_exonic    WASH7P                          ncRNA_ex
1       16495   16495   G       C       GATK_Illumina   5.38    17L166(het)80;57;23;VQSRTrancheSNP99.90to100.00|1H321(het)48;21;27;VQSRTrancheSNP99.90to100.00|1K210(het)59;38;21;VQSRTrancheSNP99.90to100.00|5G540B(het)95;77;18;VQSRTrancheSNP
1       16497   16497   A       G       GATK_Illumina   5.38    17J965(het)54;37;17;VQSRTrancheSNP99.90to100.00|17L166(het)74;61;13;VQSRTrancheSNP99.90to100.00|1K210(het)48;39;9;VQSRTrancheSNP99.90to100.00|5G540B(het)86;66;20;VQSRTrancheSNP
1       16534   16534   C       T       GATK_Illumina   5.38    17J965(het)19;12;7;VQSRTrancheSNP99.90to100.00|17L166(het)10;4;6;VQSRTrancheSNP99.90to100.00|1K210(het)8;4;4;VQSRTrancheSNP99.90to100.00|5G540B(het)14;7;7;VQSRTrancheSNP99.90to
1       16571   16571   G       A       GATK_Illumina   6.38    17J965(het)55;31;24;VQSRTrancheSNP99.90to100.00|17L166(het)47;16;31;VQSRTrancheSNP99.00to99.90|1H321(het)49;30;19;VQSRTrancheSNP99.90to100.00|1K210(het)52;18;34;VQSRTrancheSNP9
1       16580   16580   C       G       GATK_Illumina   1.38    6K141(het)43;36;7;VQSRTrancheSNP99.90to100.00   0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_intronic;downstream       WASH7P;MIR6859-1;MIR6859

Here is a sample from my original data.frame https://www.dropbox.com/s/jfmv6npiiu8n6zv/big_df.txt?dl=0

And this would be the new data.frame when the user selects the 17J965 ID

Chr     Start   End     Ref     Alt     Callers GATK_Illumina.counts    GATK_Illumina.Zygosity  GATK_Illumina.Depth     GATK_Illumina.RefCount  GATK_Illumina.AltCount  GATK_Illumina.Filter    GATK_SOLiD.counts       GATK_SOLiD.Zygosity     
1       14653   14653   C       T       GATK_Illumina   5.38    het     23      19      4       VQSRTrancheSNP99        0.125   -       -       -       -       -       0.125   -       -       -       -       -       0.430   -       -       
1       14907   14907   A       G       GATK_Illumina   6.38    het     57      40      17      VQSRTrancheSNP99        0.125   -       -       -       -       -       0.125   -       -       -       -       -       0.430   -       -       
1       14930   14930   A       G       GATK_Illumina   6.38    het     82      60      22      VQSRTrancheSNP99        0.125   -       -       -       -       -       0.125   -       -       -       -       -       0.430   -       -       
1       14933   14933   G       A       GATK_Illumina   2.38    het     88      76      12      VQSRTrancheSNP99        0.125   -       -       -       -       -       0.125   -       -       -       -       -       0.430   -       -       
1       16497   16497   A       G       GATK_Illumina   5.38    het     54      37      17      VQSRTrancheSNP99        0.125   -       -       -       -       -       0.125   -       -       -       -       -       0.430   -       -       
1       16534   16534   C       T       GATK_Illumina   5.38    het     19      12      7       VQSRTrancheSNP99        0.125   -       -       -       -       -       0.125   -       -       -       -       -       0.430   -       -       
1       16571   16571   G       A       GATK_Illumina   6.38    het     55      31      24      VQSRTrancheSNP99        0.125   -       -       -       -       -       0.125   -       -       -       -       -       0.430   -       -

Here is the link to the result ID dataframe, https://www.dropbox.com/s/2nfjud7xkb3b6mc/17J965.txt?dl=0

EDIT 1

I have several problems that I don't know how to solve:

1) How to identify the ID? The ID is always linked to the Callers column, i.e, if the caller is GATK_Illumina then the ID will be in the GATK_Illumina.samples, if the Callers column is GATK_Illumina,GATK_SOLID then the the ID can be in two columns GATK_Illumina.samples and GATK_SOLID.samples. This gets more complicated, as you can see from the big data.frame, since from the alphanumeric ID there another values: The format for the sample is always the same an alphanumeric code then a bracket, hom, or het, another bracket and then 3 values separated by a semicolon, a character vector, and then a pipe if there are another IDs info for that row. E.g: 17J965(het)23;19;4;VQSRTrancheSNP99.90to100.00|17L166(het)10;5;5;VQSRTrancheSNP99.00to99.90|1H321(het)7;4;3;VQSRTrancheSNP99.90to100.00|1K210(het)15;9;6;VQSRTrancheSNP99.00to99. Different ID info is separated by | and the format is always the same. In this example, there is only one column with ID info since the Caller column only has the value GATK_Illumina but this could get complicated with three different caller values.

2) Once you identified the rows that belong to that ID, how to put the information together? It is just get a grep from the big data.frame and then a rbind? or a subset based on the ID

3) In the child ID dataframe there are a few columns that need to be created based on the ID column and on the Caller column, for example:

 # ID column for row 1 only for value of `Caller` column `GATK_Illumina`:
 17J965(het)23;19;4;VQSRTrancheSNP99.90to100.00|17L166(het)10;5;5;VQSRTrancheSNP99.00to99.90|1H321(het)7;4;3;VQSRTrancheSNP99.90to100.00|1K210(het)15;9;6;VQSRTrancheSNP99.00to99

If I want to create the new data.frame for ID 17J965 the new columns to be created (as in the example showed before) would be:

GATK_Illumina.Zygosity, GATK_Illumina.Depth, GATK_Illumina.RefCount, GATK_Illumina.AltCount, GATK_Illumina.Filter, GATK_SOLiD.Zygosity, GATK_SOLiD.Depth, GATK_SOLiD.RefCount, GATK_SOLiD.AltCount, GATK_SOLiD.Filter

An the values after the ID would fill these columns, like this:

GATK_Illumina.Zygosity  GATK_Illumina.Depth     GATK_Illumina.RefCount  GATK_Illumina.AltCount  GATK_Illumina.Filter     GATK_SOLiD.Zygosity GATK_SOLiD.Depth     GATK_SOLiD.RefCount    GATK_SOLiD.AltCount     GATK_SOLiD.Filter 
het     23      19      4       VQSRTrancheSNP99     -     -     -     -     -

Note that the columns are filled according to the Callers column, in this example the Callers column is GATK_Illumina then only the columns created de novo with GATK_Illumina would be filled in, for the rest a - or NA values should be used.

What I have got so far is:

# Let's suppose that I want the ID: 17J965 and the big data.frame is call `big_df.txt`

 big_df <- read.delim("big_df.txt")
 sample <- grep("17J965", test2a$GATK_Illumina.samples)
 df_sample <- big_df[sample,]

 # df_sample has all the rows containing 17J965, but now I want to create the new data.frame with the extra columns and only select the correct values if there are more IDs

How can I extract all the relevant info from the column ID

# I know that I can get the ID using this command 
samples <- sub("\\(.*", "", b)

But what if my ID of interest is on the second pipe?

Thanks

Can you please attach a sample of the data (.csv etc.), or `dput()` the sample data in the question here. — SamFlynn, Dec 21 '18 at 16:07
@SamFlynn, I have added two sample files, one is a small example of the big data.frame and the second is the desired result for just one sample — user2380782, Dec 21 '18 at 16:59
The `alphanumeric code` is the ID, right? Your 1st point isn't quite clear, is the issue understanding what the ID is in the string? or extracting it(using grep, regex etc)? — SamFlynn, Dec 21 '18 at 17:31
Yes, @SamFlynn, the `alphanumeric code` is the ID. I have edited my question to make it more clear and explain a few more scenarios that can complicate the creation of the new data.frame. Thanks a lot for having a look to the data — user2380782, Dec 21 '18 at 19:29

score 0 · Answer 1 · edited Dec 26 '18 at 20:54

Hope you found a solution already, if not this might help.

library(stringr)
library(microbenchmark)
library(tidyr)
library(dplyr)


df <- read.delim("big_df.txt", header=T,sep="\t")
df$GATK_Illumina.samples <- as.character(df$GATK_Illumina.samples)

# Recommend benchmarking to find fast functions for such large dataset
microbenchmark(df[grep("17J965", df$GATK_Illumina.samples),])
microbenchmark(dplyr::filter(df ,grepl("17J965", GATK_Illumina.samples)))

# Subset data
sample <- df[grep("17J965", df$GATK_Illumina.samples),] 

# Split the column on "|" and extract the ID needed
sample$umm <- strsplit(sample$GATK_Illumina.samples,"\\|")
sample$umm <- sapply(sample$umm, function(x){x[grep("17J965",x)]})

# Removing brackets and ID to split to makes things easier to split. 
# Probably should remove brackets from main df instead for each subset 
sample$umm <- sub("17J965|\\(", "",sample$umm)
sample$umm <- sub(")",";",sample$umm)

# Split into columns, filter on the Caller and fill respective columns
sample <- sample %>% dplyr::filter(Callers == "GATK_Illumina") %>% separate(umm,sep=";",into = c("GATK_Illumina.Zygosity", "GATK_Illumina.Depth", "GATK_Illumina.RefCount", "GATK_Illumina.AltCount", "GATK_Illumina.Filter",fill= "left"))

sample <- sample %>% dplyr::filter(Callers == "GATK_SOLiD") %>% separate(umm,sep=";",into = c("GATK_SOLiD.Zygosity", "GATK_SOLiD.Depth", "GATK_SOLiD.RefCount",  "GATK_SOLiD.AltCount", "GATK_SOLiD.Filter",fill= "left"))

A few resources:

Faster way to subset on rows of a data frame in R?

separate()

Filtering row which contains a certain string using dplyr

create new data.frame based on values in another data.frame

1 Answers1