How to combine all chromosomes in a single file

Question

I downloaded 1000 genomes data (chromosome 1 -22), which is in VCF format. How I can combine all chromosomes in a single files? Should I first convert all chromosomes into plink binary files and then do the --bmerge mmerge-list? Or is there any other way to combine them? Any suggestion please?

Any good reason to combine them? – zx8754 Apr 04 '18 at 20:18 — zx8754, Apr 04 '18 at 20:18

Vince · Answer 1 · 2018-04-05T13:38:18.070

You could use PLINK as you suggest. You can also use BCFtools:

https://samtools.github.io/bcftools/bcftools.html

Specifically, the concat command:

bcftools concat ALL.chr{1..22}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -Oz -o  ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

If you use PLINK, you will likely encounter issue with 1000 Genomes as it contains multi-allelic SNPs, which is not compatible with PLINK. Also, there are SNPs that have the same RS identifier, which is also not compatible with PLINK.

You will need to resolve these issues to get PLINK to work by splitting multi-allelic SNPs into multiple records and remove records with duplicate RS identifiers (or make a new unique identifier).

Moreover, PLINK binary PED does not support genotype probabilities. I do not recall if 1000 Genomes includes this type of information. If it does and you want to retain it, you cannot convert it to binary PED as the genotype probabilities will be hard-called, see:

https://www.cog-genomics.org/plink2/input

Specifically:

Since the PLINK 1 binary format cannot represent genotype probabilities, calls with uncertainty greater than 0.1 are normally treated as missing, and the rest are treated as hard calls.

So, if you plan to retain VCF format for the output, I recommend against using PLINK.

EDIT

Here is method to convert VCF to PLINK:

To build PLINK compatible files from the VCF files, duplicate positions and SNP id need to be merged or removed. Here I opt to remove all duplicate entries. Catalogue duplicate SNP id:

grep -v '^#' <(zcat ALL.chr${chrom}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz) | cut -f 3 | sort | uniq -d > ${chrom}.dups

Using BCFTools, split multi-allelic SNPs, and using plink remove duplicate SNPs id found in previous step:

bcftools norm -d both -m +any -Ob ALL.chr${chrom}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | plink --bcf /dev/stdin --make-bed --out ${chrom} --allow-extra-chr 0 --memory 6000 --exclude ${chrom}.dups

Importantly, this is not the only way to resolve issue in converting VCF to PLINK. For instance, you can uniquely assign identifiers to duplicate RS id.

Should you decide to go with PLINK I am happy to edit this answer with BCFtools and PLINK commands that addresses the limitations above. — Vince, Apr 05 '18 at 00:20
So far I tried PLINK, and got the "multiallelic position errors". You are right Vince. I am aiming to prune the data with conditions: --maf 0.01 --snps-only --indep-pairwise 50 10 0.2. Desired file should be a single file (preferably in PLINK). Can I only perform this with bcftools? — bha, Apr 05 '18 at 13:23
I edited answer to include the method I used to convert VCF to PLINK. After this you can use `--merge-list` as you specified to merge chromosomes into one file. — Vince, Apr 05 '18 at 13:45
I am not sure whether there is option for PLINK 1.9 to accept multi-allelic sites. If I recall it just exits with error. PLINK 2.0 seems to remove them, so you can likely use that to read VCF: https://www.cog-genomics.org/plink/2.0/input#vcf. — Vince, Apr 05 '18 at 13:48
Many thanks for detailed answer. It look likes that PLINK 1.9 can do most of the things. I use bcftools concant to combine all the 22 VCF files, and prunned with desired maf and LD using PLINK 1.9, as PLINK uses VCF file. Well, i was expecting some multi position variants in the output file, but i wonder, i could not spot any one. It looks to me that when i pruned VCF files with desired LD and maf, multi position variants pruned as well. Do you think its correct way or i must use above two scripts line first? — bha, Apr 07 '18 at 16:23
I did not used --merge-list, becuase bcftools contact all chromosomes. — bha, Apr 07 '18 at 16:25

score 2 · Answer 2 · answered Apr 04 '18 at 11:53

picard GatherVcfs https://broadinstitute.github.io/picard/command-line-overview.html

Gathers multiple VCF files from a scatter operation into a single VCF file. Input files must be supplied in genomic order and must not have events at overlapping positions.

How to combine all chromosomes in a single file

2 Answers2