9

I have this VCF format file, I want to read this file in R. However, this file contains some redundant lines which I want to skip. I want to get something like in the result where the row starts with the line matching #CHROM.

This is what I have tried:

chromo1<-try(scan(myfile.vcf,what=character(),n=5000,sep="\n",skip=0,fill=TRUE,na.strings="",quote="\"")) ## find the start of the vcf file
skip.lines<-grep("^#CHROM",chromo1)


column.labels<-read.delim(myfile.vcf,header=F,nrows=1,skip=(skip.lines-1),sep="\t",fill=TRUE,stringsAsFactors=FALSE,na.strings="",quote="\"")
num.vars<-dim(column.labels)[2]

myfile.vcf

    #not wanted line
    #unnecessary line
    #junk line
    #CHROM  POS     ID      REF     ALT
    11      33443   3        A       T
    12      33445   5        A       G

result

    #CHROM  POS     ID      REF     ALT
    11      33443   3        A       T
    12      33445   5        A       G
Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
MAPK
  • 5,635
  • 4
  • 37
  • 88

3 Answers3

7

Maybe this could be good for you:

# read two times the vcf file, first for the columns names, second for the data
tmp_vcf<-readLines("test.vcf")
tmp_vcf_data<-read.table("test.vcf", stringsAsFactors = FALSE)

# filter for the columns names
tmp_vcf<-tmp_vcf[-(grep("#CHROM",tmp_vcf)+1):-(length(tmp_vcf))]
vcf_names<-unlist(strsplit(tmp_vcf[length(tmp_vcf)],"\t"))
names(tmp_vcf_data)<-vcf_names

p.s.: If you have several vcf files then you should use lapply function.

Best, Robert

Ricardo Guerreiro
  • 497
  • 1
  • 4
  • 17
  • Great answer, but do you always use points in your variable names? I find it confusing (especially if you also know python), prefer much more underscores. I guess it's a matter of taste though, cheers. – Ricardo Guerreiro Dec 21 '18 at 09:01
  • 1
    @RicardoGuerreiro dots are idiomatic in variable names in R. Widely used and perfectly acceptable. – Calimo Dec 21 '18 at 12:17
6

data.table::fread reads it as intended, see example:

library(data.table)

#try this example vcf from GitHub
vcf <- fread("https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf")

#or if the file is local:
vcf <- fread("path/to/my/vcf/sample.vcf")

We can also use vcfR package, see the manuals in the link.

zx8754
  • 52,746
  • 12
  • 114
  • 209
1

Don't know how fread reads vcf correctly in comments above, but use 'skip' to define the first row start (or, if integer, amount of rows to skip).

library(data.table)
df = fread(file='some.vcf', sep='\t', header = TRUE, skip = '#CHROM')
Jedi Knight
  • 367
  • 2
  • 10