1

I am an R beginner. I am using R to analyse my large next-generation sequencing vcf file and am having some difficulties. I have imported the very large vcf file as a data frame (2446824 obs. of 177 variables) and made a subset with just the 3 samples I am interested in (2446824 obs. of 29 variables).

I now wish to reduce the dimensions even further (reduce the rows to around 200000). I have been trying to use grep, but cannot get it to work. The error I get is

Error in "0/1" | "1/0" : 
   operations are possible only for numeric, logical or complex types

Here is a small example part of the file I am working with.

Chr Start   End Ref Alt Func.refGene    INFO    FORMAT  Run.Sample1 Run.Sample2 Run.Sample3
489 1   909221  909221  T   C   PASS    GT:AD:DP:GQ:PL  0/1:11,0:11:33:0,33,381     ./.     ./.
490 1   909238  909238  G   C   PASS    GT:AD:DP:GQ:PL  0/1:11,6:17:99:171,0,274    0/1:6,5:11:99:159,0,116     1/1:0,15:15:36:441,36,0
491 1   909242  909242  A   G   PASS    GT:AD:DP:GQ:PL  0/1:16,4:13:45:0,45,532     0/0:11,0:11:30:0,30,366     0/0:16,0:17:39:0,39,479
492 1   909309  909309  T   C   PASS    GT:AD:DP:GQ:PL  0/0:23,0:23:54:0,54,700     0/0:15,1:16:36:0,36,463     0/0:19,0:19:48:0,48,598

There are two different ways in which I would like to reduce the rows in this dataset:

Code 1. If either $Run.Sample1 or $Run.Sample2 or $Run.Sample3 contains a “0/1” or “1/0” or “1/1” keep the entire row

Code 2. If $Run.Sample1 or $Run.Sample2 contain either a “0/1” or “1/0” or “1/1” and $Run.Sample3 contain “0/0” keep the entire row

The results I would want to get from code 1 are:

Chr Start   End Ref Alt Func.refGene    INFO    FORMAT  Run.Sample1 Run.Sample2 Run.Sample3
489 1   909221  909221  T   C   PASS    GT:AD:DP:GQ:PL  0/1:11,0:11:33:0,33,381     ./.     ./.
490 1   909238  909238  G   C   PASS    GT:AD:DP:GQ:PL  0/1:11,6:17:99:171,0,274    0/1:6,5:11:99:159,0,116     1/1:0,15:15:36:441,36,0
491 1   909242  909242  A   G   PASS    GT:AD:DP:GQ:PL  0/1:16,4:13:45:0,45,532     0/0:11,0:11:30:0,30,366     0/0:16,0:17:39:0,39,479

The results I would want to get from code 2 are:

Chr Start   End Ref Alt Func.refGene    INFO    FORMAT  Run.Sample1 Run.Sample2 Run.Sample3
489 1   909221  909221  T   C   PASS    GT:AD:DP:GQ:PL  0/1:11,0:11:33:0,33,381     ./.     ./.
491 1   909242  909242  A   G   PASS    GT:AD:DP:GQ:PL  0/1:16,4:13:45:0,45,532     0/0:11,0:11:30:0,30,366     0/0:16,0:17:39:0,39,479

Many thanks for your help

Kelly

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
  • Welcome, Kelly. This question on [creating reproducible examples in R](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) may be of use to you in the future. In particular, using `dput` to provide a subset of your data, rather than plaintext, makes it much easier for others to quickly get working on the toy problem. – x4nd3r Sep 01 '14 at 05:47
  • @voidHead - true, but in this case a simple `read.table` around the plaintext gets it done, so it's not as big a concern as it can be. Of more concern is probably the fact that the first 8 columns aren't really necessary at all. – thelatemail Sep 01 '14 at 05:48
  • 2
    Your second result seems wrong to me. The first row doesn't have sample 3 contain "0/0" at all. – thelatemail Sep 01 '14 at 05:55
  • Yeah, those results are wrong. I just spent 20 minutes thinking I was wrong – Rich Scriven Sep 01 '14 at 06:56
  • It would be easiest on the brain and eyes to use `sapply(df[,tail(names(df), 3)], substring, 1, 3)` and `%in%` – Rich Scriven Sep 01 '14 at 06:57
  • @thelatemail I am very very sorry for the error. I have realised my error is in the description. Code 2. If $Run.Sample1 or $Run.Sample2 contain either a “0/1” or “1/0” or “1/1” and $Run.Sample3 contain “0/0” **or "./."** keep the entire row – Kelly Williams Sep 02 '14 at 03:24
  • There are many dedicated packages that work with VCF files. – zx8754 Sep 12 '16 at 19:57

1 Answers1

5

Try For the first case:

  dat[Reduce(`|`,lapply(dat[9:11], function(x) grepl("0/1|1/0|1/1", x))),]

For the second case based on the conditions mentioned:

 dat[ Reduce(`|`,lapply(dat[9:10], function(x) grepl("0/1|1/0|1/1", x))) 
              & grepl("0/0", dat[,11]),]

Update

 dat[ Reduce(`|`,lapply(dat[9:10], function(x) grepl("0/1|1/0|1/1", x))) 
       & grepl("\\.\\/\\.|0/0", dat[,11]),]
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662