Remove pattern and everything before using AWK in fasta file

Question

I searched a lot but could not find a solution to my issue. I have a file that looks like:

>HEADER1
AACTGGTTACGTGGTTCTCT
>HEADER2
GGTTTCTC
>HEADER3
CCAGGTTTCGAGGGGTTACGGGGTA

I want to remove GGTT pattern and everything before it. So basically there are several of these patterns in some of the lines so I want to remove all of them including everything before the pattern or among them.

The desired output should look like:

>HEADER1
CTCT
>HEADER2
TCTC
>HEADER3
ACGGGGTA

I tried suggested ways but could not be able to adjust it to my data.

Thank you in Advance for your help.

jas · Accepted Answer · 2020-09-29T12:03:21.923

4

If it's not possible for your headers to include GGTT, I suppose the easiest would be:

$ sed 's/.*GGTT//' file
>HEADER1
CTCT
>HEADE2
TCTC
>HEADER3
ACGGGGTA

If your headers might contain GGTT, then awk probably be better:

$ awk '!/^>/ {sub(/.*GGTT/, "")}1' file
>HEADER1
CTCT
>HEADE2
TCTC
>HEADER3
ACGGGGTA

In both cases, the .*GGTT is "greedy", so it doesn't matter if there are multiple instances of GGTT, it will always match up to and remove everything through the last occurrence.

In the awk version, the pattern !/^>/ makes sure that subsitution is only done on lines that do not start with >.

edited Sep 29 '20 at 12:03

answered Sep 29 '20 at 11:55

jas

10,715
2
30
41

2

`sed` supports a wide range of filters too: `sed '/^>/! s/.*GGTT//'` .. also, headers wouldn't match, if my basic understanding of fasta files is correct – Sundeep Sep 29 '20 at 12:02
@jas, Thank you for your answer. I have another question, it would be great if you help me with it. Why we did not use ```sed``` in these versions like: ```sed 's///g'``` or ```sed 's///d'``` ? I read about ```d``` or ```g``` but did not completely get what are their advantages. I have seen sed in this format as well ```sed 's//d'```. – Apex Sep 29 '20 at 12:07
2

@Apex `s//d` should be `//d` (which means delete lines matching the regex, note that only regex is supplied here, no replacement string)... `g` flag indicates replace all occurrences, but in this solution there's only one match, so `g` is not needed (though it will still work with `g`)... see https://stackoverflow.com/tags/sed/info for learning resources – Sundeep Sep 29 '20 at 12:18
@Sundeep, Thank you for the detailed hints. I could not completely get your meaning about saying that ```g``` is not required in my example because there is only one match, but I have several ```GGTT``` in some of the lines. I know I am missing something here :) – Apex Sep 29 '20 at 12:23
1

@Apex that is mentioned in the answer: `In both cases, the .*GGTT is "greedy", so it doesn't matter if there are multiple instances of GGTT, it will always match up to and remove everything through the last occurrence.` - if that still doesn't clear up, see [my tutorial on quantifiers](https://learnbyexample.github.io/learn_gnused/breere-regular-expressions.html#quantifiers) as well as the "Longest match wins" section that follows – Sundeep Sep 29 '20 at 12:57
Note that `sed 's/^[^>]*GGTT//' file` will exclude header lines from being matched and still consumes any number of `GGTT` matches and in between strings. – potong Sep 29 '20 at 15:04
This does not handle multiline (wrapped) fasta files, which are often found in bioinformatics. – Timur Shtatland Sep 29 '20 at 15:08

score 2 · Answer 2 · answered Sep 29 '20 at 15:05

Note that in general, sequences in fasta format as shown in the question may span multiple lines (= they are often wrapped to 80 or 100 nucleotides per line). This answer handles such cases correctly as well, unlike some other answers in this thread.

Use these two Perl one-liners connected by a pipe. The first one-liner does all of the common reformatting of the fasta sequences that is necessary in this and similar cases. It removes newlines and whitespace in the sequence (which also unwraps the sequence), but does not change the sequence header lines. It also properly handles leading and trailing whitespace/newlines in the file. The second one-liner actually removes everything up to and including the last GGTT in the sequence, in a case-insensitive manner.

Note: If GGTT is at the end of the sequence, the output will be a header plus an empty sequence. See seq4 in the example below. This may cause issues with some bioinformatics tools used downstream.

# Create the input for testing:

cat > in.fa <<EOF

>seq1 with blanks
ACGT GGTT ACGT
>seq2 with newlines
ACGT

GGTT

ACGT

>seq3 without blanks or newlines
ACGTGGTTACGT

>seq4 everything should be deleted, with empty sequence in the output
ACGTGGTTACGTGGTT

>seq5 lowercase
acgtggttacgt

EOF

# Reformat to single-line fasta, then delete subsequences:

perl -ne 'chomp; if ( /^>/ ) { print "\n" if $n; print "$_\n"; $n++; } else { s/\s+//g; print; } END { print "\n"; }' in.fa | \
  perl -pe 'next if /^>/; s/.*GGTT//i;' > out.fa

Output in file out.fa:

>seq1 with blanks
ACGT
>seq2 with newlines
ACGT
>seq3 without blanks or newlines
ACGT
>seq4 everything should be deleted, with empty sequence in the output

>seq5 lowercase
acgt

The Perl one-linera use these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.

chomp : Remove the input line separator (\n on *NIX).
if ( /^>/ ) : Test if the current line is a sequence header line.
$n : This variable is undefined (false) at the beginning, and true after seeing the first sequence header, in which case we print an extra newline. This newline goes at the end of each sequence, starting from the first sequence.
END { print "\n"; } : Print the final newline after the last sequence.
s/\s+//g; print; : If the current line is sequence (not header), remove all the whitespace and print without the terminal newline.

next if /^>/; : Skip the header lines. s/.*GGTT//i; : Replace everything (.*) up to and including the the last GGTT with nothing (= delete it). The /i modifier means case-insensitive match.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups

Remove line breaks in a FASTA file

Remove pattern and everything before using AWK in fasta file

2 Answers2