Is there a way to remove both duplicates and redundant substrings from a list, using shell tools? By "redundant", I mean a string that is contained within another string, so "foo" is redundant with "foobar" and "barfoo". For example, take this list:
abcd
abc
abd
abcd
bcd
and return:
abcd
abd
uniq
, sort -u
and awk '!seen[$0]++'
remove duplicates effectively but not redundant strings:
How to delete duplicate lines in a file without sorting it in Unix?
Remove duplicate lines without sorting
I can loop through each line recursively with grep
but this is is quite slow for large files. (I have about 10^8 lines to process.)
There's an approach using a loop in Python here: Remove redundant strings based on partial strings and Bash here: How to check if a string contains a substring in Bash but I'm trying to avoid loops. Edit: I mean nested loops here, thanks for the clarification @shellter
Is there a way to use a awk's match()
function with an array index? This approach builds the array progressively so never has to search the whole file, so should be faster for large files. Or am I missing some other simple solution?
An ideal solution would allow matching of a specified column, as for the methods above.
EDIT
Both of the answers below work, thanks very much for the help. Currently testing for performance on a real dataset, will update with results and accept an answer. I tested both approaches on the same input file, which has 430,000 lines, of which 417,000 are non-redundant. For reference, my original looped grep approach took 7h30m with this file.
Update:
James Brown's original solution took 3h15m and Ed Morton's took 8h59m. On a smaller dataset, James's updated version was 7m versus the original's 20m. Thank you both, this is really helpful.
The data I'm working with are around 110 characters per string, with typically hundreds of thousands of lines per file. The way in which these strings (which are antibody protein sequences) are created can lead to characters from one or both ends of the string getting lost. Hence, "bcd" is likely to be a fragment of "abcde".