5

I am working with PLINK to analyse genome-wide data.

Does anyone know how to remove duplicated SNPs?

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
user1236418
  • 163
  • 2
  • 2
  • 9

4 Answers4

4

In PLINK 1.9, use --list-duplicate-vars suppress-first, which will list duplicates, and remove one (the first one), leaving the other intact. I've know this to slip up though.

Instead of using --exclude as Davy suggested, you can also use --extract, keeping rather than getting rid of a list of SNPs. There's an easy method on any Unix based system (assuming your data are in PED/MAP format and cut up by chromossome):

for i in {1..22}; do
  cat yourfile_chr${i}.map | grep "$i" | cut -f -4 | uniq | cut -f -2 | keepers_chr${i}.txt;
done

This will create a keepers_chr.txt file with SNP IDs for SNPs at unique locations. Then run PLINK feeding it your original file(s) and use --extract keepers_chr, with --make-bed --out unique_file

Kevin L. Keys
  • 997
  • 8
  • 21
Benjamatic
  • 41
  • 1
2

There is no command to do it automatically that I am aware of, but the way I have done it in the past is to get a list of SNPs that are duplicated, change the duplicates to rs1001.dup for example, then run --update-allele --update-name and then create a list of the duplicates, so all the entries will have .dup at the end of their names, and then run --extract duplicateSNPs.txt --make-bed --out yourfilename.dups.removed

Getting a list of SNPs that are duplicated shouldn't be too hard if you are familiar with R. Sorry to give you a "well just learn X!!!" answer

Davy Kavanagh
  • 4,809
  • 9
  • 35
  • 50
2

A couple of others ideas that might be of help/interest:

  1. You can also remove vcf duplicates using bcftools with the command bcftools norm -D, --remove-duplicates bcftools documentation can be found at https://samtools.github.io/bcftools/bcftools.html

  2. In the spirit of also just using Unix to remove duplicates, I've previously used the following (input is a compressed vcf file) gunzip -c input.vcf.gz | grep "^[^##]" | cut -f3 | sort | uniq -d > plink.dupvar plink.dupvar is the filename the PLINK program looks for when performing the duplication removal step.

0

With R is easier, although you have to use a TPED file. Once you manage to get a TPED file just copy and paste this into a R console:

a = read.table("yourfile.TPED",sep = " ",header=FALSE)
b = a[!duplicated(a$V2),]
write.table(b,file="newfile.TPED",sep=" ",quote = FALSE,col.names = FALSE, row.names=FALSE)

The newfile.TPED without duplicates will apper in the R working directory. HINT: you can change the yourfile.TPED and newfile.TPED part of the script for the actual name of you file.

JackWhiteIII
  • 1,388
  • 2
  • 11
  • 25