2

I have a file in fasta format as in example below. I would like to extract entries from that file when sequence: 'CGTACG' occurs more than once.

>seq1
AAATTCCGTACGGGCCTCT
>seq2
TGGAATCACAGCGGCGTACGCAGCGGCGGCTGCGGCCGTACGGCG
>seq3
AATGCCAAACGTACGAACAT

In the example the output would be (as the sequence 'CGTACG' occurs twice):

>seq2
TGGAATCACAGCGGCGTACGCAGCGGCGGCTGCGGCCGTACGGCG
codeforester
  • 39,467
  • 16
  • 112
  • 140
fattel
  • 133
  • 6

4 Answers4

4

All you need is:

awk '/^>/{seq=$0} gsub(/CGTACG/,"&") > 1{print seq ORS $0}' file
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
3

You can use awk for this:

for file in *; do
    [[ -f "$file" ]] || continue # skip if not a regular file
    if ! awk -v seq=CGTACG '$0 ~ seq".*"seq {exit(1)}' "$file"; then
        # the file has two or more occurrences of the string on the same line, process it
        # more code
    fi
done

awk looks for the string in each file and exits 1 as soon as it finds a line that has two or more occurrences of the string. if ! test makes sure that we pick up the file only when awk has an exit code of 1.

If we looking for more than one match on different lines, then:

for file in *; do
    [[ -f "$file" ]] || continue # skip if not a regular file
    if ! awk -v seq=CGTACG '$0 ~ seq {x++; if(x>1) exit(1)}' "$file"; then
        # the file has two or more occurrences of the string on different lines, process it
        # more code
    fi
done
codeforester
  • 39,467
  • 16
  • 112
  • 140
2

Another using given string as field separator:

$ awk -F"CGTACG" '/^>/{p=$0;next} NF>2{print p ORS $0}' file

Output:

>seq2
TGGAATCACAGCGGCGTACGCAGCGGCGGCTGCGGCCGTACGGCG

Explained:

$ awk -F"CGTACG" '    # using the substring as field separator
/^>/ {                # buffer the seqn record for possible use if match
    b=$0
    next
} 
NF>2 {                # if field count more than 2 ie. at least 2 field separators
    print b ORS $0    # output
}' file
James Brown
  • 36,089
  • 7
  • 43
  • 59
1

This is an extension to the question: how can I count the frequency of letters

awk -v subseq="CGTACG" '
     />/ && gsub(subseq,subseq,seq) > 1 { print name; print seq }
     />/{name=$0;seq="";next}
     {seq=seq $0}
     END { if(gsub(subseq,subseq,seq) > 1) { print name; print seq } }
    ' file.fasta

This method merges all multi-line sequences in a single line and checks if subseq appears more than ones. It does this using the gsub function:

gsub(ere, repl[, in]) Behave like sub (see below), except that it shall replace all occurrences of the regular expression (like the ed utility global substitute) in $0 or in the in argument when specified.

sub(ere, repl[, in ]) Substitute the string repl in place of the first instance of the extended regular expression ERE in string in and return the number of substitutions. <snip> If in is omitted, awk shall use the current record ($0) in its place.

source: Awk Posix Standard

This, however, can be cleaned up a bit:

awk -v subseq="CGTACG" '
     function count_subseq(seq,subseq,   t) {
         t=seq;gsub(RS,RS,t)
         return gsub(subseq,subseq,t)
     }
     />/ && count_subseq(seq,subseq) > 1 { print name; print seq }
     />/{name=$0;seq="";next}
     {seq=seq RS $0}
     END { if(count_subseq(seq,subseq) > 1) { print name; print seq } }
    ' file.fasta

Identically, using bioawk, you can do

bioawk -c fastx -v subseq="CGTACG" '(gsub(subseq,subseq,seq)>1){print ">"$name; print $seq}' file.fasta

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

kvantour
  • 25,269
  • 4
  • 47
  • 72