How can I delete lines which are substrings of other lines in a file while keeping the longer strings which include them?
I have a file that contain peptide sequences as strings - one sequence string per line. I want to keep the strings which contain all the sequences and remove all lines which are substrings of other lines in the file.
Input:
GSAAQQYW
ATFYGGSDASGT
GSAAQQYWTPANATFYGGSDASGT
GSAAQQYWTPANATF
ATFYGGSDASGT
NYARTTCRRTG
IVPVNYARTTCRRTGGIRFTITGHDYFDN
RFTITGHDYFDN
IVPVNYARTTCRRTG
ARTTCRRTGGIRFTITG
Expected Output:
GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN
The output should keep only longest strings and remove all lines which are substrings of the longest string. So, in the input above, lines 1,2,4 and 5 are substrings of line 3 so output retained line 3. Similarily for the strings on lines 6,8,9 and 10 all of which are substrings of line 7, thus line 7 is retained and written to output.