Assume we have collection of strings as input:
str1: ABCD
str2: ABCD
str3: AWXYD
str4: AWXYD
The goal is to remove duplicates and preserve the unique ones. Having the above input, our output should look like the following:
ABCD
AWXYD
Unfortunately, the machine that produces this collection is error prone and sometimes misses some alphabets (see below). Luckily, we are able detect that there is a missing part. But we don’t know how large this missing part is. In reality we have the following input:
str1: A?CD
str2: AB?D
str3: AWXYD
str4: A?D
where ?
indicates the missing part.
In this example, we would like to preserve A?CD
or AB?D
and also AWXYD
.
In an attempt to solve this, I substituted ?
with .*
and assumed these strings to be regular expressions:
Reg1 --> A.*CD
Reg2 --> AB.*D
Reg3 --> AWXYD
Reg4 --> A.*D
Now I am trying to identify the duplicates by comparing regular expressions. Using this approach, one can easily match Reg4
to Reg3
because Reg3
is actually a string (no missing part). Things become complicated when both have missing parts and therefore you have to compare regular expressions.
I wonder if it is possible or if there is a better solution for this.
Thanks!
Edit 1: Note that str1
and str2
might came from different strings (e.g. AXCD
and ABXD
). Our goal is to remove any (possible) duplicates and be sure that preserved strings are unique (even if we remove more). This is why we preserve either str1
or str2
. Thanks to Aaron who pointed this out.
Edit 2: There are millions of strings. That's why an algorithms is needed.