This is an extension to the question: how can I count the frequency of letters
awk -v subseq="CGTACG" '
/>/ && gsub(subseq,subseq,seq) > 1 { print name; print seq }
/>/{name=$0;seq="";next}
{seq=seq $0}
END { if(gsub(subseq,subseq,seq) > 1) { print name; print seq } }
' file.fasta
This method merges all multi-line sequences in a single line and checks if subseq
appears more than ones. It does this using the gsub
function:
gsub(ere, repl[, in])
Behave like sub
(see below), except that it shall replace all occurrences of the regular expression (like the ed utility global substitute) in $0
or in the in
argument when specified.
sub(ere, repl[, in ])
Substitute the string repl
in place of the first instance of the extended regular expression ERE
in string in and return the number of substitutions. <snip> If in
is omitted, awk shall use the current record ($0
) in its place.
source: Awk Posix Standard
This, however, can be cleaned up a bit:
awk -v subseq="CGTACG" '
function count_subseq(seq,subseq, t) {
t=seq;gsub(RS,RS,t)
return gsub(subseq,subseq,t)
}
/>/ && count_subseq(seq,subseq) > 1 { print name; print seq }
/>/{name=$0;seq="";next}
{seq=seq RS $0}
END { if(count_subseq(seq,subseq) > 1) { print name; print seq } }
' file.fasta
Identically, using bioawk
, you can do
bioawk -c fastx -v subseq="CGTACG" '(gsub(subseq,subseq,seq)>1){print ">"$name; print $seq}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.