0

I have a text file containing the sequence IDs. These Ids file contain some duplicate IDs. Few IDs are also present more then 2 times in this file. I want to find unique IDs in one file and repeated IDs in another file. Furthermore I am also interested to find the number, how many times the repeated IDs present in the file.

I found duplicated sequence using the following command

$ cat id.txt | grep '^>' | sort | uniq -d > dupid.txt

This gives me the duplicated sequences in "dupid.txt" file. But how do I get those that are present more then 2 times and how many times they are present? Secondly, how do I find unique sequences?

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
TCFP HCDG
  • 35
  • 9

1 Answers1

0

A bit of searching might have found this answer, with many suggestions on traditional uses of uniq.

Also, note that:

$ cat id.txt | grep '^>'

...is basically the same as:

$ grep '^>' id.txt

The so-called "Useless Use Of Cat"

But to your question - find uniq ids, dupes, and dupes with counts - here's a try using awk that processes its stdin, and writes to three output files the user must name, trying to avoid clobbering output files that already exist. One pass, but holds all input in memory before starting output.

#!/bin/bash

[ $# -eq 3 ] || { echo "Usage: $(basename $0) <uniqs> <dupes> <dupes_counts>" 1>&2; exit 1; }

chk() {
  [ -e "$1" ] && { echo "$1: already exists" 1>&2; return 1; }
  return $2
}

chk "$1" 0; chk "$2" $?; chk "$3" $? || exit 1

awk -v u="$1" -v d="$2" -v dc="$3" '
  {
    idc[$0]++
  }
  END {
    for (id in idc) {
      if (idc[id] == 1) {
        print id >> u
      } else {
        print id >> d
        printf "%d:%s\n", idc[id], id >> dc
      }
    }
  }
'

Save as (for example) "doit.sh", and then invoke via:

$ grep '^>' id.txt | doit.sh uniques.txt dupes.txt dupes_counts.txt
Community
  • 1
  • 1
sjnarv
  • 2,334
  • 16
  • 13