Print differences between not sorted strings from files

Question

I have two files that contain n lines with a string in each line. I want to print out the difference in characters between those lists. You could imagine the operation as a sort of "Subtraction" of letters. This is how it should look like:

List1       List2      Result
AaBbCcDd    AaCcDd     Bb
AaBbCcE     AaBbCc     E
AaBbCcF     AaCcF      Bb

Which means that the second list is not sorted alphabetically, but all the substrings to remove are sorted within each string (Aa comes before Bb comes before Cc). Note that the elements to remove can be either 1 or 2 characters long (Aa or F), always starting with uppercase letters followed (sometimes) by a lowercased letter. The strings are completely composed of permutations of a few "elements" like Aa, Bb, Cc, Dd, E, F, Gg, ... and so on.

This question has been answered in very similar form here: Bash script Find difference between two strings, but only for two strings entered manually, whereas I need to do the operation many hundreds of times. I am struggling with implementing files as a source to this command while also separating the characters correctly. Here is my adaptation:

split_chars() { sed $'s/./&\\\n/g' <<< "$1"; }
comm -23 <(split_chars AaBbCcDd) <(split_chars AaCcDd)

which gives as output

B
b

so still not quite what I want even in this single case. I guess that the split_chars command is the key here but I was not able to apply it to my files in any way. Putting the file names inside the brackets does not work obviously. For reference, a simple

commm -23 List1 List2

just leads to

AaBbCcDd
AaBbCcEe
AaBbCcF
comm: file 2 is not in sorted order

Well they aren't. The strings are composed of permutations a few elements like ``Aa``, ``Bb``, ``Cc``, ``Dd``, ``E``, ``F``, ``Gg``, ... and so on. — And, Apr 17 '19 at 11:28
Only the order of strings in the second list is not sorted, but the elements within each string is sorted alphabetically, so ``AaBb`` exists while ``BbAa`` does not. — And, Apr 17 '19 at 11:39

Socowi · Accepted Answer · 2019-04-17T13:20:30.927

Since you don't want to split characters but substrings starting with an uppercase letter you should replace split_chars with the following function.

split() { sed 's/[A-Z]/\n&/g' <<< "$1"; }

Splitting a line can be undone by deleting all newline characters using tr -d \\n.

To subtract a list of lines from another list of lines you can use grep without having to sort.

grep -vFxf subtrahend minuend

This will print in original order those lines from file minuend which are not in file subtrahend.

To put everything together, you have to

read both files line by line in parallel
split each string into a list of lines
subtract those lists
undo the splitting

Here is a simplified version assuming your input files contain only lines of the described format and have the same length.

split() { sed 's/[A-Z]/\n&/g' <<< "$1"; }
subtract() { grep -vFxf "$2" "$1"; }
union() { tr -d \\n; echo; }
paste List1 List2 | while read -r minuend subtrahend; do
    subtract <(split "$minuend") <(split "$subtrahend") | union
done

Bash scripts with loops are slow. If you need a faster solution you should rewrite this script in a more advanced language like perl or python.

score 0 · Answer 2 · answered Apr 17 '19 at 13:43

Another in GNU awk:

$ gawk 'NR==FNR {
    a[FNR]=$0
    next
}
{
    patsplit($0 a[FNR],b,/[A-Z][a-z]?/)
    printf "%s%s%s", a[FNR],OFS,$0
    for(i in b)
        if(!(match($0,b[i])&&match(a[FNR],b[i])))
            printf "%s%s", OFS, b[i]
    print ""
}' file1 file2

Output:

List1 List2
AaBbCcDd AaCcDd Bb
AaBbCcE AaBbCc E
AaBbCcF AaCcF Bb

Print differences between not sorted strings from files

2 Answers2

Linked