I downloaded 1000 genomes data (chromosome 1 -22), which is in VCF format. How I can combine all chromosomes in a single files? Should I first convert all chromosomes into plink binary files and then do the --bmerge mmerge-list
? Or is there any other way to combine them? Any suggestion please?

- 12,024
- 2
- 30
- 47

- 77
- 2
- 7
-
Any good reason to combine them? – zx8754 Apr 04 '18 at 20:18
2 Answers
You could use PLINK as you suggest. You can also use BCFtools:
https://samtools.github.io/bcftools/bcftools.html
Specifically, the concat
command:
bcftools concat ALL.chr{1..22}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -Oz -o ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
If you use PLINK, you will likely encounter issue with 1000 Genomes as it contains multi-allelic SNPs, which is not compatible with PLINK. Also, there are SNPs that have the same RS identifier, which is also not compatible with PLINK.
You will need to resolve these issues to get PLINK to work by splitting multi-allelic SNPs into multiple records and remove records with duplicate RS identifiers (or make a new unique identifier).
Moreover, PLINK binary PED does not support genotype probabilities. I do not recall if 1000 Genomes includes this type of information. If it does and you want to retain it, you cannot convert it to binary PED as the genotype probabilities will be hard-called, see:
https://www.cog-genomics.org/plink2/input
Specifically:
Since the PLINK 1 binary format cannot represent genotype probabilities, calls with uncertainty greater than 0.1 are normally treated as missing, and the rest are treated as hard calls.
So, if you plan to retain VCF format for the output, I recommend against using PLINK.
EDIT
Here is method to convert VCF to PLINK:
To build PLINK compatible files from the VCF files, duplicate positions and SNP id need to be merged or removed. Here I opt to remove all duplicate entries. Catalogue duplicate SNP id:
grep -v '^#' <(zcat ALL.chr${chrom}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz) | cut -f 3 | sort | uniq -d > ${chrom}.dups
Using BCFTools, split multi-allelic SNPs, and using plink remove duplicate SNPs id found in previous step:
bcftools norm -d both -m +any -Ob ALL.chr${chrom}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | plink --bcf /dev/stdin --make-bed --out ${chrom} --allow-extra-chr 0 --memory 6000 --exclude ${chrom}.dups
Importantly, this is not the only way to resolve issue in converting VCF to PLINK. For instance, you can uniquely assign identifiers to duplicate RS id.

- 3,325
- 2
- 23
- 41
-
Should you decide to go with PLINK I am happy to edit this answer with BCFtools and PLINK commands that addresses the limitations above. – Vince Apr 05 '18 at 00:20
-
So far I tried PLINK, and got the "multiallelic position errors". You are right Vince. I am aiming to prune the data with conditions: --maf 0.01 --snps-only --indep-pairwise 50 10 0.2. Desired file should be a single file (preferably in PLINK). Can I only perform this with bcftools? – bha Apr 05 '18 at 13:23
-
I edited answer to include the method I used to convert VCF to PLINK. After this you can use `--merge-list` as you specified to merge chromosomes into one file. – Vince Apr 05 '18 at 13:45
-
I am not sure whether there is option for PLINK 1.9 to accept multi-allelic sites. If I recall it just exits with error. PLINK 2.0 seems to remove them, so you can likely use that to read VCF: https://www.cog-genomics.org/plink/2.0/input#vcf. – Vince Apr 05 '18 at 13:48
-
Many thanks for detailed answer. It look likes that PLINK 1.9 can do most of the things. I use bcftools concant to combine all the 22 VCF files, and prunned with desired maf and LD using PLINK 1.9, as PLINK uses VCF file. Well, i was expecting some multi position variants in the output file, but i wonder, i could not spot any one. It looks to me that when i pruned VCF files with desired LD and maf, multi position variants pruned as well. Do you think its correct way or i must use above two scripts line first? – bha Apr 07 '18 at 16:23
-
picard GatherVcfs https://broadinstitute.github.io/picard/command-line-overview.html
Gathers multiple VCF files from a scatter operation into a single VCF file. Input files must be supplied in genomic order and must not have events at overlapping positions.

- 34,472
- 31
- 113
- 192