5

Is there a way to remove both duplicates and redundant substrings from a list, using shell tools? By "redundant", I mean a string that is contained within another string, so "foo" is redundant with "foobar" and "barfoo". For example, take this list:

abcd
abc
abd
abcd
bcd

and return:

abcd
abd

uniq, sort -u and awk '!seen[$0]++' remove duplicates effectively but not redundant strings: How to delete duplicate lines in a file without sorting it in Unix? Remove duplicate lines without sorting

I can loop through each line recursively with grep but this is is quite slow for large files. (I have about 10^8 lines to process.) There's an approach using a loop in Python here: Remove redundant strings based on partial strings and Bash here: How to check if a string contains a substring in Bash but I'm trying to avoid loops. Edit: I mean nested loops here, thanks for the clarification @shellter

Is there a way to use a awk's match() function with an array index? This approach builds the array progressively so never has to search the whole file, so should be faster for large files. Or am I missing some other simple solution?

An ideal solution would allow matching of a specified column, as for the methods above.

EDIT

Both of the answers below work, thanks very much for the help. Currently testing for performance on a real dataset, will update with results and accept an answer. I tested both approaches on the same input file, which has 430,000 lines, of which 417,000 are non-redundant. For reference, my original looped grep approach took 7h30m with this file.
Update:
James Brown's original solution took 3h15m and Ed Morton's took 8h59m. On a smaller dataset, James's updated version was 7m versus the original's 20m. Thank you both, this is really helpful.

The data I'm working with are around 110 characters per string, with typically hundreds of thousands of lines per file. The way in which these strings (which are antibody protein sequences) are created can lead to characters from one or both ends of the string getting lost. Hence, "bcd" is likely to be a fragment of "abcde".

  • 1
    Define "redundant strings" in clear terms please. – Paul Hodges Oct 08 '20 at 17:37
  • @KamilCuk No, sorry for the confusion. Edited to reflect the actual file structure. – garethmorgan Oct 08 '20 at 17:37
  • so should `ab` be considered to be redundant to `abcd` ? What about just `a`... hmm , or just `b` or that matter? Also, define looping. any script that reads thru a complete file is a loop, and if you need to go back to dbl-check, that will be another loop (I wouldn't worry about that, get your problem clearly defined and solved, then you can worry about looping ;-) ). Good luck! – shellter Oct 08 '20 at 17:50
  • How long are the real world strings in the file? – James Brown Oct 08 '20 at 20:36

3 Answers3

5
$ awk '{print length($0), $0}' file |
    sort -k1,1rn -k2 -u |
    awk '!index(str,$2){str = str FS $2; print $2}'
abcd
abd

The above assumes the set of unique values will fit in memory.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    That string search is a beautiful optimization. Have done it before, but was trying to figure out something that didn't store the whole dataset. I guess you gotta. – Paul Hodges Oct 08 '20 at 19:55
  • 2
    Sorting like this first seems mandatory. I had some testing on a small dictionary. `1)` The bottleneck in this task is the last part, where I see this approach with `index` is several times faster than doing something like `'{for (i in a) if (i ~ $2) next} {a[$2]; print $2}'`. `2)` Some (small) perfomance improvement can be done removing duplicates at first awk with `!a[$0]++{...}` as sort will be faster like `sort -rn` only (no splitting), unless if you have some more reasons for this sort. – thanasisp Oct 08 '20 at 19:55
  • 1
    @PaulHodges it doesn't store the whole data set, just the strings that don't exist as substrings of previously read strings. So if you read a string like `abcd` it'll be stored as part of `str` but then future strings like `ab`, `abc`, `bcd` from the input won't be stored because they don't need to be. – Ed Morton Oct 08 '20 at 21:54
  • 1
    @thanasisp I'm not sure if doing the uniqueness test in awk would really be any faster but it could be, idk. It will use more memory though as then **every** unique string from the input will need to be stored in memory even if it's a substring of some other string. `sort -rn` would be fine if we didn't have to add `-u` to remove dups. – Ed Morton Oct 08 '20 at 22:03
  • This method answers my question without a nested loop, but it seems to slow down significantly on large files. It builds the file incrementally without using much memory so maybe reading and writing to disk becomes limiting? – garethmorgan Oct 09 '20 at 14:50
  • 1
    That seems less likely than just `index()` gets slower the longer the string `str` gets. There are various ways to speed it up of course but they all involve a loop which you wanted to avoid. – Ed Morton Oct 09 '20 at 15:35
  • Thanks. I'd assumed that `index()` would suffer less than a nested loop as the length increased, but evidently not. Good to know. – garethmorgan Oct 09 '20 at 20:16
5

An awk that on first run extracts and stores all substrings and strings to two arrays subs and strs and checks on second run:

$ awk '
NR==FNR {                                    # first run 
    if(($0 in strs)||($0 in subs))           # process only unseen strings
        next
    len=length()-1                           # initial substring length
    strs[$0]                                 # hash the complete strings
    while(len>=1) {                          
        for(i=1;i+len-1<=length();i++) {     # get all substrings of current len
            asub=substr($0,i,len)            # sub was already resetved :(
            if(asub in strs)                 # if substring is in strs
                delete strs[asub]            # we  do not want it there
            subs[asub]                       # hash all substrings too
        }
        len--                                
    }
    next
}
($0 in strs)&&++strs[$0]==1' file file

Output:

abcd
abd

I tested the script with about 30 M records of 1-20 char ACGT strings. The script ran 3m27s and used about 20 % of my 16 GBs. Using strings of length 1-100 I OOM'd in a few mins (tried it again with about 400k records oflength of 50-100 and it uses about 200 GBs and runs about an hour). (20 M records of 1-30 chars ran 7m10s and used 80 % of the mem)

So if your data records are short or you have unlimited memory, my solution is fast but in the opposite case it's going to crash running out of memory.

Edit:

Another version that tries to preserve memory. On the first go it checks the min and max lengths of strings and on the second run won't store substrings shorter than global min. For about 400 k record of length 50-100 it used around 40 GBs and ran 7 mins. My random data didn't have any redundancy so input==putput. It did remove redundance with other datasets (2 M records of 1-20 char strings):

$ awk '
BEGIN {
    while((getline < ARGV[1])>0)            # 1st run, check min and max lenghts
        if(length()<min||min=="")           # TODO: test for length()>0, too
            min=length()
        else if(length()>max||max=="")
            max=length()
#       print min,max > "/dev/stderr"       # debug   
        close(ARGV[1])

    while((getline < ARGV[1])>0) {          # 2nd run, hash strings and substrings
#       if(++nr%10000==0)                   # debug
#           print nr > "/dev/stderr"        # debug
        if(($0 in strs)||($0 in subs))
            continue
        len=length()-1
        strs[$0]
        while(len>=min) {
            for(i=1;i+len-1<=length();i++) {
                asub=substr($0,i,len)
                if(asub in strs)
                    delete strs[asub]
                subs[asub]
            }
            len--
        }
    }
    close(ARGV[1])

    while((getline < ARGV[1])>0)             # 3rd run, output 
        if(($0 in strs)&&!strs[$0]++)
            print
}' file
James Brown
  • 36,089
  • 7
  • 43
  • 59
  • I have a flu (I wish it's just a regular seasonal flu) so I hope that script is not a complete disaster... – James Brown Oct 08 '20 at 20:47
  • 1
    The script I posted was primarily to satisfy the OPs statement `I'm trying to avoid loops` so I wasn't particularly shooting for fast but I'm very surprised to hear it's slower than the script you posted given you have a nested loop in there. Do you have a script you could share in your question to generate the input you're running with? – Ed Morton Oct 08 '20 at 21:47
  • 1
    I had a random 30 MB string of ACGTs floating around and I cut it to pieces with: `awk '{i=1;while(i file` So basically no, I don't. :D (Gotta sleep now, tho. GN) – James Brown Oct 08 '20 at 21:54
  • works very well, it's fast. I think memory usage will not be that big for regular words (average 8-10 characters), without many large strings (100 characters), because the substrings will be repeated a lot. – thanasisp Oct 08 '20 at 22:38
2

EDIT


This won't work. Sorry.

@Ed's solution is the best idea I can imagine without some explicit looping, and even that is implicitly scanning over the near-entire growing history of data on every record. It has to.

Can your existing resources hold that whole column in memory, plus a delimiter per record? If not, then you're going to be stuck with either very complex optimization algorithms, or VERY slow redundant searches.

Original post left for reference in case it gives someone else an inspiration.


That's a lot of data.

Given the input file as-is,

while read next
do [[ "$last" == "$next" ]] && continue                    # throw out repeats
   [[ "$last" =~ $next   ]] && continue                    # throw out sustrings
   [[ "$next" =~ $last   ]] && { last="$next"; continue; } # upgrade if last a substring of next
   echo $last                  # distinct string
   last="$next"                # set new key
done < file

yields

abcd
abd

With a file of that size I wouldn't trust that sort order, though. Sorting is going to be very slow and take a lot of resources, but will give you more trustworthy results. If you can sort the file once and use that output as the input file, great. If not, replace that last line with done < <( sort -u file ) or something to that effect.

Reworking this logic in awk will be faster.

$: sort -u file | awk '1==NR{last=$0} last~$0{next} $0~last{last=$0;next} {print last;last=$0}' 

Aside from the sort this uses trivial memory and should be very fast and efficient, for some value of "fast" on a file with 10^8 lines.

Paul Hodges
  • 13,382
  • 1
  • 17
  • 36
  • I THOUGHT that `sort | awk` script wouldn't discard `bc` as redundant if the previous lines were `abc` and `ab` despite `bc` being a substring of `abc` but when I tried running it on input generated by `printf 'abc\nab\bc\n'` I got no output at all which, though a failure in not outputting `abc`, wasn't the one I was expecting! – Ed Morton Oct 08 '20 at 18:15
  • `\b` is throwing you off. :) Your point is totally valid though. This just dosn't have the complexity to handle the problem. :( – Paul Hodges Oct 08 '20 at 19:41
  • 1
    Darn, I actually just created the file manually, I only put that printf in the comment to try to make it easier to explain how to create such a file and of course I screwed that up! I meant `printf 'abc\nab\nbc\n'` of course. Yeah, it's impossible to solve this with a script that only looks at the immediately previous line since the current line might be a substring of an earlier line than just the one before it. – Ed Morton Oct 08 '20 at 19:43