How do I remove duplicated SNPs using PLink?

Question

I am working with PLINK to analyse genome-wide data.

Does anyone know how to remove duplicated SNPs?

Wouldn't that be --exclude duplicateSNPs.txt? Check out the unix utility uniq for a solution to your problem. — tommy.carstensen, Oct 12 '12 at 08:28
[Should the plink tag be used for the genome tool or for the PuTTY (SSH) command line tool](http://meta.stackexchange.com/q/178289/146482) — Tobias Kienzler, Apr 29 '13 at 07:13

score 4 · Answer 1 · edited May 06 '17 at 11:44

In PLINK 1.9, use --list-duplicate-vars suppress-first, which will list duplicates, and remove one (the first one), leaving the other intact. I've know this to slip up though.

Instead of using --exclude as Davy suggested, you can also use --extract, keeping rather than getting rid of a list of SNPs. There's an easy method on any Unix based system (assuming your data are in PED/MAP format and cut up by chromossome):

for i in {1..22}; do
  cat yourfile_chr${i}.map | grep "$i" | cut -f -4 | uniq | cut -f -2 | keepers_chr${i}.txt;
done

This will create a keepers_chr.txt file with SNP IDs for SNPs at unique locations. Then run PLINK feeding it your original file(s) and use --extract keepers_chr, with --make-bed --out unique_file

score 2 · Answer 2 · answered Jun 22 '12 at 09:13

There is no command to do it automatically that I am aware of, but the way I have done it in the past is to get a list of SNPs that are duplicated, change the duplicates to rs1001.dup for example, then run --update-allele --update-name and then create a list of the duplicates, so all the entries will have .dup at the end of their names, and then run --extract duplicateSNPs.txt --make-bed --out yourfilename.dups.removed

Getting a list of SNPs that are duplicated shouldn't be too hard if you are familiar with R. Sorry to give you a "well just learn X!!!" answer

score 2 · Answer 3 · answered Aug 15 '20 at 04:13

A couple of others ideas that might be of help/interest:

You can also remove vcf duplicates using bcftools with the command bcftools norm -D, --remove-duplicates bcftools documentation can be found at https://samtools.github.io/bcftools/bcftools.html
In the spirit of also just using Unix to remove duplicates, I've previously used the following (input is a compressed vcf file) gunzip -c input.vcf.gz | grep "^[^##]" | cut -f3 | sort | uniq -d > plink.dupvar plink.dupvar is the filename the PLINK program looks for when performing the duplication removal step.

score 0 · Answer 4 · edited Jun 25 '15 at 16:09

With R is easier, although you have to use a TPED file. Once you manage to get a TPED file just copy and paste this into a R console:

a = read.table("yourfile.TPED",sep = " ",header=FALSE)
b = a[!duplicated(a$V2),]
write.table(b,file="newfile.TPED",sep=" ",quote = FALSE,col.names = FALSE, row.names=FALSE)

The newfile.TPED without duplicates will apper in the R working directory. HINT: you can change the yourfile.TPED and newfile.TPED part of the script for the actual name of you file.

How do I remove duplicated SNPs using PLink?

4 Answers4