bash script - Use patterns list in sed to remove substrings

Question

I have this file (adapters.txt) with a list of patterns:

cactctttccctacacgacgctcttccg
cactctttccctacacgacgctcttccgaatcta
cactctttccctacacgacgctcttccgaatctaatt
cactctttccctacacgacgctcttccgaatctaatta
cactctttccctacacgacgctcttccgaatctag
cactctttccctacacgacgctcttccgaatctagc
cactctttccctacacgacgctcttccgacctcattcc
cactctttccctacacgacgctcttccgacctcattcccaccctcttccg
cactctttccctacacgacgctcttccgatc
cactctttccctacacgacgctcttccgatccaatt
cactctttccctacacgacgctcttccgatttagc
cactctttccctacacgacgctcttccgatttagct
cactctttccctacacgacgctcttccgatttcattc
cactctttccctacacgacgctcttccgatttcattcttcccc
cactctttccctacacgacgctcttccgattttatttc
cactctttccctacacgacgctcttccggatcta
cactctttccctacacgacgctcttccggatctaatt
cactctttccctacacgacgctcttccggatctaattc
cactctttccctacacgacgctcttccggatctaattca
cactctttccctacacgacgctcttccggatctagctt
cactctttccctacacgacgctcttccggttcta
cactctttccctacacgacgctttccgatcta
cactctttccctacacgacgctttccgatctaattc
cactctttccctacacgacgtcttccgatctaattctggaccatagtgcaatgt
cactctttccctacacgcgctcttccgatcta
cactctttccctacacgcgctcttccgatctaattcg
cactctttccctacacgcgctcttccgatctaattcgg
cactctttccctacacgcgctcttccgatctaattcggcgg
cactctttccctacacgcgctcttccgatctagct
cactctttccctaccgacgctcttccgatcta
cactctttccctacacgacg

I need find and remove these patterns from "sequences.fasta" file:

>seq01
cactctttccctacacgacgctcttccgWANTEDSEQUENCE
>seq01
cactctttccctacacgacgctcttccgaatctaWANTEDSEQUENCE
>seq03
cactctttccctacacgacgctcttccgaatctaattWANTEDSEQUENCE
>seq04
cactctttccctacacgacgctcttccgaatctaattaWANTEDSEQUENCE
>seq05
cactctttccctacacgcgctcttccgatctaattcggWANTEDSEQUENCE
>seq06
cactctttccctacacgcgctcttccgatctaattcggcggWANTEDSEQUENCE
>seq07
cactctttccctacacgcgctcttccgatctagctWANTEDSEQUENCE
>seq08
cactctttccctaccgacgctcttccgatctaWANTEDSEQUENCE

So the wanted output should be:

>seq01
WANTEDSEQUENCE
>seq02
WANTEDSEQUENCE
>seq03
WANTEDSEQUENCE
>seq04
WANTEDSEQUENCE
>seq05
WANTEDSEQUENCE
>seq06
WANTEDSEQUENCE
>seq07
WANTEDSEQUENCE
>seq08
WANTEDSEQUENCE

(Just for the sake of the example I've used "WANTEDSEQUENCE" instead of the real sequences)

I've tried the following (and some variations. I've also tried a while read):

ADAPS=($(cat adapters.txt))
FASTA="sequences.fasta"


for ADAP in "${ADAPS[@]}";
do
    sed "s/${ADAP}//g" "${FASTA}" > output.fasta
done

But I got this:

>seq01
ctcttccgWANTEDSEQUENCE
>seq01
ctcttccgaatctaWANTEDSEQUENCE
>seq03
ctcttccgaatctaattWANTEDSEQUENCE
>seq04
ctcttccgaatctaattaWANTEDSEQUENCE
>seq05
cactctttccctacacgcgctcttccgatctaattcggWANTEDSEQUENCE
>seq06
cactctttccctacacgcgctcttccgatctaattcggcggWANTEDSEQUENCE
>seq07
cactctttccctacacgcgctcttccgatctagctWANTEDSEQUENCE
>seq08
cactctttccctaccgacgctcttccgatctaWANTEDSEQUENCE

How can I solve this?

Sure, the only issue is that I'm not that familiar with all awk syntaxes. — Tiago Minuzzi, Jul 22 '20 at 23:21
Each iteration of the loop is operating on the original sequences.fasta file, not the output of the previous iteration. — Barmar, Jul 22 '20 at 23:40
So the final output file is just the last `$ADAP` replacement. — Barmar, Jul 22 '20 at 23:40
A one-liner: `sed -f <(sort -r adapters.txt | sed 's/.*/s|&||/') sequences.fasta` — M. Nejat Aydin, Jul 23 '20 at 05:07

Cyrus · Accepted Answer · 2020-07-23T00:07:10.197

3

Sort adapters.txt in reverse order by its line length, create a sed script from its output and use it with bash's command substitution <(...) with a second sed to apply it to sequences.fasta:

sed -f <(awk '{ print length, $0 }' adapters.txt | sort -rn | cut -d" " -f2- | sed -E 's/(.*)/s|&||/') sequences.fasta

Output:

>seq01
WANTEDSEQUENCE
>seq01
WANTEDSEQUENCE
>seq03
WANTEDSEQUENCE
>seq04
WANTEDSEQUENCE
>seq05
WANTEDSEQUENCE
>seq06
WANTEDSEQUENCE
>seq07
WANTEDSEQUENCE
>seq08
WANTEDSEQUENCE

The sorting of adapters.txt is necessary because it contains substrings from other strings in the same file.

Same code in multiple lines and files:

awk '{ print length, $0 }' adapters.txt | sort -rn | cut -d" " -f2- > adapters_sorted.txt
sed -E 's/(.*)/s|&||/' adapters_sorted.txt > sed.script
sed -f sed.script sequences.fasta

edited Jul 23 '20 at 00:07

answered Jul 22 '20 at 23:36

Cyrus

84,225
14
89
153

I'm not sure I understood this part `s|&||`. Do you mind explaining it? – Tiago Minuzzi Jul 23 '20 at 00:01
1

`s/(.*)/s|&||/'`: `&` is replaced by the part matching the regular expression `.*`. So here the whole line. Because I nested two `s` commands in each other I changed the syntax of the second one from `s///` to `s|||`. So `abc` becomes `s|abc||`. – Cyrus Jul 23 '20 at 00:12
2

Could be simplified as `sed -f <(sort -r adapters.txt | sed 's/.*/s|&||/') sequences.fasta`. Sorting by the line length is not necessary. A simple alphabetical reverse sort is enough for this kind of input. Thus the larger one comes first if any two strings have a common prefix. – M. Nejat Aydin Jul 23 '20 at 04:55

score 1 · Answer 2 · answered Jul 23 '20 at 04:53

1

With GNU ed and bash.

#!/usr/bin/env bash

ed -s sequences.fasta < <(
  printf '%s\n' '1,$-1s/$/\\|/' '1,$j' 's/^/,s\//' 's/$/\/\//' '$a' ,p Q . ,p Q |
  ed -s adapters.txt
)

answered Jul 23 '20 at 04:53

Jetchisel

7,493
2
19
18

score 0 · Answer 3 · answered Jul 23 '20 at 01:40

Here is a POSIX awk solution:

$ awk 'NR==FNR{seq[FNR]=$0; x=FNR; next}
      {for(i=1; i<=x; i++) if ($0 ~ "^" seq[i]) {sub(seq[i],""); print $0; next}
      print}
      ' <(awk '{ print length()"\t"$0}' adapters.txt | sort -nr | cut -f2)  sequences.fasta
>seq01
WANTEDSEQUENCE
>seq01
WANTEDSEQUENCE
>seq03
WANTEDSEQUENCE
>seq04
WANTEDSEQUENCE
>seq05
WANTEDSEQUENCE
>seq06
WANTEDSEQUENCE
>seq07
WANTEDSEQUENCE
>seq08
WANTEDSEQUENCE

Or gawk where you can sort the sequences from long to short internally:

$ gawk 'BEGIN{PROCINFO["sorted_in"] = "@val_num_desc"}
      NR==FNR { seq[$0] = length($0); next }
      {for (e in seq)  if($0~"^" e) {sub(e,""); print $0; next}
      print}
      ' adapters.txt sequences.fasta
  # same output

Ed Morton · Answer 4 · 2020-07-23T04:51:51.600

With GNU awk for sorted_in:

$ cat tst.awk
NR==FNR {
    adapters2lengths[$1] = length($1)
    next
}
!/^>/ {
    PROCINFO["sorted_in"] = "@val_num_desc"
    for (adapter in adapters2lengths) {
        if ( index($0,adapter) == 1 ) {
            $0 = substr($0,adapters2lengths[adapter]+1)
            break
        }
    }
}
{ print }

.

$ awk -f tst.awk adapters.txt sequences.fasta
>seq01
WANTEDSEQUENCE
>seq01
WANTEDSEQUENCE
>seq03
WANTEDSEQUENCE
>seq04
WANTEDSEQUENCE
>seq05
WANTEDSEQUENCE
>seq06
WANTEDSEQUENCE
>seq07
WANTEDSEQUENCE
>seq08
WANTEDSEQUENCE

The functional difference between this and @dawg's gawk solution is that this one only does string comparisons while theirs does regexp comparisons - it'll only matter if your "adapters.txt" file ever contains regexp metacharacters, all else being equal I just prefer using strings unless I need regexps.

bash script - Use patterns list in sed to remove substrings

4 Answers4