2
vcftools --vcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf --weir-fst-pop POP1.txt --weir-fst-pop POP2.txt --out fst.POP1.POP2

The above script computes Fst distances on 1000 Genomes population data using Weir and Cokerham's 1984 formula. This formula uses 3 variance components, namely a,b,c (between populations; between individuals within populations; between gametes within individuals within populations).

The output directly provides the result of the formula but not the components that the program calculated to arrive at the final result. How can I ask Vcftools to output the values for a,b,c?

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
Davide Piffer
  • 113
  • 2
  • 12

1 Answers1

4

If you can get the data into the format for hierfstat, you can get the variance components from varcomp.glob. What I normally do is:

  1. use vcftools with --012 to get genotypes
  2. convert 0/1/2/-1 to hierfstat format (eg., 11/12/22/NA)
  3. load the data into hierfstat and compute (see below)

R example:

library(hierfstat)
data = read.table("hierfstat.txt", header=T, sep="\t")
levels = data.frame(data$popid)
loci = data[,2:ncol(data)]
res = varcomp.glob(levels=levels, loci=loci, diploid=T)
print(res$loc)
print(res$F)

Fst for each locus (row) therefore is (without hierarchical design), from res$loc: res$loc[1]/sum(res$loc). If you have more complicated sampling, you'll need to interpret the variance components differently.

--update per your comment--

I do this in Pandas, but any language would do. It's a text replacement exercise. Just get your .012 file into a dataframe and convert as below. I read in row by row into numpy b/c I have tons of snps, but read_csv would work, too.

import pandas as pd
import numpy as np
z12_data = []
for i, line in enumerate(open(z12_file)):
    line = line.strip()
    line = [int(x) for x in line.split("\t")]
    z12_data.append(np.array(line))
    if i % 10 == 0:
        print i
z12_data = np.array(z12_data)
z12_df = pd.DataFrame(z12_data)
z12_df = z12_df.drop(0, axis=1)
z12_df.columns = pd.Series(z12_df.columns)-1

hierf_trans = {0:11, 1:12, 2:22, -1:'NA'}
def apply_hierf_trans(series):
    return [hierf_trans[x] if x in hierf_trans else x for x in series]
hierf = df.apply(apply_hierf_trans)
hierf.to_csv("hierfstat.txt", header=True, index=False, sep="\t")

Then, you'd read that file hierfstat.txt into R, these are your loci. You'd need to specify your levels in your sampling design (e.g., your population). Then call varcomp.glob() to get the variance components. I have a parallel version of this here if you want to use it.

Note that you are specifying 0 as the reference allele, in this case. May be what you want, maybe not. I often calculate minor allele frequency and make 2 the minor allele, but it depends on your study goal.

Chris F.
  • 773
  • 6
  • 15
  • This is a very useful answer. I have carried now step 1. However, I do not know how to do step 2 (convert 0/1/2/-1 to hierfstat format (eg., 11/12/22/NA)). – Davide Piffer May 16 '15 at 09:51