Note that in general, sequences in fasta format as shown in the question may span multiple lines (= they are often wrapped to 80 or 100 nucleotides per line). This answer handles such cases correctly as well, unlike some other answers in this thread.
Use these two Perl one-liners connected by a pipe. The first one-liner does all of the common reformatting of the fasta sequences that is necessary in this and similar cases. It removes newlines and whitespace in the sequence (which also unwraps the sequence), but does not change the sequence header lines. It also properly handles leading and trailing whitespace/newlines in the file. The second one-liner actually removes everything up to and including the last GGTT
in the sequence, in a case-insensitive manner.
Note: If GGTT
is at the end of the sequence, the output will be a header plus an empty sequence. See seq4 in the example below. This may cause issues with some bioinformatics tools used downstream.
# Create the input for testing:
cat > in.fa <<EOF
>seq1 with blanks
ACGT GGTT ACGT
>seq2 with newlines
ACGT
GGTT
ACGT
>seq3 without blanks or newlines
ACGTGGTTACGT
>seq4 everything should be deleted, with empty sequence in the output
ACGTGGTTACGTGGTT
>seq5 lowercase
acgtggttacgt
EOF
# Reformat to single-line fasta, then delete subsequences:
perl -ne 'chomp; if ( /^>/ ) { print "\n" if $n; print "$_\n"; $n++; } else { s/\s+//g; print; } END { print "\n"; }' in.fa | \
perl -pe 'next if /^>/; s/.*GGTT//i;' > out.fa
Output in file out.fa
:
>seq1 with blanks
ACGT
>seq2 with newlines
ACGT
>seq3 without blanks or newlines
ACGT
>seq4 everything should be deleted, with empty sequence in the output
>seq5 lowercase
acgt
The Perl one-linera use these command line flags:
-e
: Tells Perl to look for code in-line, instead of in a file.
-n
: Loop over the input one line at a time, assigning it to $_
by default.
-p
: Loop over the input one line at a time, assigning it to $_
by default. Add print $_
after each loop iteration.
chomp
: Remove the input line separator (\n
on *NIX).
if ( /^>/ )
: Test if the current line is a sequence header line.
$n
: This variable is undefined (false) at the beginning, and true after seeing the first sequence header, in which case we print an extra newline. This newline goes at the end of each sequence, starting from the first sequence.
END { print "\n"; }
: Print the final newline after the last sequence.
s/\s+//g; print;
: If the current line is sequence (not header), remove all the whitespace and print without the terminal newline.
next if /^>/;
: Skip the header lines.
s/.*GGTT//i;
: Replace everything (.*
) up to and including the the last GGTT
with nothing (= delete it). The /i
modifier means case-insensitive match.
SEE ALSO:
perldoc perlrun
: how to execute the Perl interpreter: command line switches
perldoc perlre
: Perl regular expressions (regexes)
perldoc perlre
: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
Remove line breaks in a FASTA file