0

I have a VCF file with 200 samples (mitochondrial genome of Plasmodium falciparum). I managed to transform the raw data into Pandas dataframe. Here is a pic to take a look at:

enter image description here

And a few relevant lines from the actual file:

##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT 
allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in 
the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called 
genotypes">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  FP0008-C   
FP0009-C
Pf_M76611       19      .       A       T       33196.8 MissingVQSLOD;Mitochondrion 
AC=9;AF=0.0003019;AN=29816;ANN=T|intragenic_variant|MODIFIER|PF3D7_MIT00100|PF3D7_MIT00100|gene_variant|PF3D7_MIT00100|||n.19A>T||||||;AS_InbreedingCoeff=0.0437;AS_QD=20.96;BaseQRankSum=0.141;CDS;DP=559138;ExcessHet=0.0026;FS=2.892;InbreedingCoeff=0.52;MLEAC=8;MLEAF=0.0002683;MQ=60;MQRankSum=0;QD=20.96;ReadPosRankSum=0.856;RegionType=Mitochondrion;SOR=0.789;set=genotypegvcfs  GT:AD:DP:GQ:PL  0/0:52,0:52:99:0,120,1800  0/0:54,0:54:99:0,120,1800   

A small subset of samples I am working on can be found here: https://www.mediafire.com/file/6dkct6guk5zx83m/sample.vcf/file

I am trying to calculate allele frequency (AF) for each variant in the dataset. The population is assumed to be all the samples in the file.

In other words, even though the VCF file contains AF statistics from population data, I would like to know how to recalculate AF only from the VCF file. I believe I need to know how AC and AN are calculated first.

AF is calculated this way: allele count (AC) / allele number (AN)

I tried to calculate them from the information from GT (counting 0/0, 1/1 and 0/1 in all the 200 samples), but then I learned that this calculation is not suitable for mithocondrial data.

All lines in the VCF file contain an AF statistic:

$ grep -v 'AF=' sample.vcf | grep -v '^#'
<No output>

The information is provided under INFO column, but how were they calculated? If I do not have AC and AN to begin with, what should I do? Is there a way to calculate AC and AN from the other information provided in the dataset? such as the data mentioned in the FORMAT column? GT: genotype AD: unfiltered allele depth DP : read depth at this position for this sample (Integer) GQ : conditional genotype quality, encoded as a phred quality PL : the phred-scaled genotype likelihoods rounded to the closest integer

I tried to figure out a way using PyVCF and VCFtoolz (not VCFtools) for Python, but i could not figure anything out. Is there any way possible to do it in Python?

Thanks

eh329
  • 94
  • 10
  • Refrain from showing your dataframe as an image. Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions. – itprorh66 Jan 28 '23 at 14:06

1 Answers1

1

I have been in same situation before. My advice is instead of using the whole VCF file into pandas dataframe and then struggling with it, Use GATK VariantsToTable tool to make your VCF into an indexed CSV file. The commands are fairly easy depending on the fileds inside the VCF you want. Once that is handled you can go into analysis with slinging pandas and numpy all the way!

The tool can be installed via their github repo: GATK-4-github

or by using Conda (I use it this way): GATK-4-conda

Once you are done with that go for: User-Guide

and You'll be good to go! all the best!

Zer0day
  • 11
  • 1