Remove lines which are substrings of other lines

Question

How can I delete lines which are substrings of other lines in a file while keeping the longer strings which include them?

I have a file that contain peptide sequences as strings - one sequence string per line. I want to keep the strings which contain all the sequences and remove all lines which are substrings of other lines in the file.

Input:

GSAAQQYW
ATFYGGSDASGT
GSAAQQYWTPANATFYGGSDASGT
GSAAQQYWTPANATF
ATFYGGSDASGT
NYARTTCRRTG
IVPVNYARTTCRRTGGIRFTITGHDYFDN
RFTITGHDYFDN
IVPVNYARTTCRRTG
ARTTCRRTGGIRFTITG

Expected Output:

GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN

The output should keep only longest strings and remove all lines which are substrings of the longest string. So, in the input above, lines 1,2,4 and 5 are substrings of line 3 so output retained line 3. Similarily for the strings on lines 6,8,9 and 10 all of which are substrings of line 7, thus line 7 is retained and written to output.

Ah, do you want to remove any string that is a substring of another string? And what have you tried so far? — Benjamin W., Feb 23 '16 at 00:43
https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html#index-index_0028_0029-function — user3159253, Feb 23 '16 at 01:11
I think I know now what you ask, but the question is very unclear (and has 3 close votes for being unclear at the moment). You should specify more clearly which strings exactly should be kept (the ones that are not substrings of any other strings...?), what the input can look like (is the longest string always first, as in the example?), and what you have tried so far. You can [edit] your question. — Benjamin W., Feb 23 '16 at 01:23
sorry for the confusion. i just edited to make it clear. hope its clear now. Also, as these are peptide sequences, i convert in to fasta file and use CD-HIT program which clusters similar sequences with 100% identity and produces output. later, convert that fasta to text file for further analysis. — Empyrean rocks, Feb 23 '16 at 01:32

clt60 · Answer 1 · 2016-02-23T01:32:43.820

3

Maybe:

input=./input_file
while read -r str
do
[[ $(grep -c "$str" "$input") == 1 ]] && echo $str
done < "$input"

produces:

GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN

it is slow - but simple..

edited Feb 23 '16 at 01:32

answered Feb 23 '16 at 01:26

clt60

62,119
17
107
194

Ed Morton · Accepted Answer · 2016-02-24T17:02:48.863

2

This should do what you want:

$ cat tst.awk
{ arr[$0]; strs=strs $0 RS }
END {
    for (str in arr) {
        if ( split(strs,tmp,str) == 2 ) {
            print str
        }
    }
}

$ awk -f tst.awk file
IVPVNYARTTCRRTGGIRFTITGHDYFDN
GSAAQQYWTPANATFYGGSDASGT

It loops through every string in arr and then uses that as the separator value for split() - if the string occurs once then the full file contents will be split in half and so split() would return 2 but if the string is a subset of some other string then the file contents would be split into multiple segments and so split would return some number higher than 2.

If a string can appear multiple times in the input and you want it printed multiple times in the output (see the question in the comment from @G.Cito below) then you'd modify the above to:

!cnt[$0]++ { strs=strs $0 RS }
END {
    for (str in cnt) {
        if ( split(strs,tmp,str) == 2 ) {
            for (i=1;i<=cnt[str];i++) {
                print str
            }
        }
    }
}

edited Feb 24 '16 at 17:02

answered Feb 23 '16 at 01:58

Ed Morton

188,023
17
78
185

1

I added an explanation at the bottom – Ed Morton Feb 24 '16 at 00:05
@EdMorton so of all the strings in `file`, `awk` finds and prints only those that can be divided in two (skipping those that can't be divided or divide more than once). ++ nice and simple! Is there a simple way to handle the case where the "long" string (what I call the "master string" in the more baroque perl solution below) occurs more than once? With your `awk` script and the perl `%uniq` hash it would get left out of the output. – G. Cito Feb 24 '16 at 16:51
If the requirement is to only print it once then you just need to change the first line to `!arr[$0]++{ strs=strs $0 RS }` (idiomatically `arr` would be named `seen` or `count` when used in that context) so it only appears once in the strs string that gets split later. If the requirement is to print it as many times as it appeared in the input then you'd additionally change the `print str` to `for (i=1;i<=arr[str];i++) print str`. I updated my answer to show that. – Ed Morton Feb 24 '16 at 17:00

score 1 · Answer 3 · edited May 23 '17 at 12:07

As a perl "one liner" (this should work for cutting and pasting into a terminal):

perl -E 'chomp(@r=<>); 
        for $i (0..$#r){ 
           map { $uniq{$_}++ if ( index( $r[$i], $_ ) != -1 ) } @r; 
        }
        for (sort keys %uniq){ say if ( $uniq{$_} == 1 ); }' peptide_seq.txt

We read and chomp the file (peptide_seq.txt) from STDIN (<>) and save it in @r which will be an array in which each element is a string from each line in the file.
Next we iterate through the array and map the elements of @r to a hash (%uniq) where each key is the content of each line; and each value is a number that is incremented when a line is found to be a substring of another line. Using index we can check whether a string contains a sub-string and increment the corresponding hash value if index() does not return the value for "not found" (-1).
The "master" strings contain all the other strings as sub-strings of themselves and will only be incremented once, so we loop again to print the keys of the %uniq hash that have the value == 1. This second loop could be a map instead:

map { say if ( $uniq{$_} == 1 ) } sort keys uniq ;

As a self-contained script that could be:

#!perl -l
chomp(@r=<DATA>); 

for $i (0..$#r) {
  map { $uniq{$_}++ if ( index( $r[$i], $_ ) != -1 ) } @r ;
}

map { print if ($uniq{$_} == 1) } sort keys %uniq ; 

__DATA__
GSAAQQYW
ATFYGGSDASGT
GSAAQQYWTPANATFYGGSDASGT
GSAAQQYWTPANATF
ATFYGGSDASGT
NYARTTCRRTG
IVPVNYARTTCRRTGGIRFTITGHDYFDN
RFTITGHDYFDN
IVPVNYARTTCRRTG
ARTTCRRTGGIRFTITG

Output:

GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN

Sana · Answer 4 · 2016-02-25T03:42:45.653

-1

This will help you what you exactly need:

awk '{ print length(), NR, $0 | "sort -rn" }' sed_longer.txt | head -n 2

edited Feb 25 '16 at 03:42

answered Feb 23 '16 at 05:14

Sana

25
7

Remove lines which are substrings of other lines

4 Answers4

Linked