3

Assume we have collection of strings as input:

str1: ABCD
str2: ABCD
str3: AWXYD
str4: AWXYD

The goal is to remove duplicates and preserve the unique ones. Having the above input, our output should look like the following:

ABCD
AWXYD

Unfortunately, the machine that produces this collection is error prone and sometimes misses some alphabets (see below). Luckily, we are able detect that there is a missing part. But we don’t know how large this missing part is. In reality we have the following input:

str1: A?CD
str2: AB?D
str3: AWXYD
str4: A?D

where ? indicates the missing part.

In this example, we would like to preserve A?CD or AB?D and also AWXYD.

In an attempt to solve this, I substituted ? with .* and assumed these strings to be regular expressions:

Reg1 --> A.*CD
Reg2 --> AB.*D
Reg3 --> AWXYD
Reg4 --> A.*D

Now I am trying to identify the duplicates by comparing regular expressions. Using this approach, one can easily match Reg4 to Reg3 because Reg3 is actually a string (no missing part). Things become complicated when both have missing parts and therefore you have to compare regular expressions.

I wonder if it is possible or if there is a better solution for this.

Thanks!

Edit 1: Note that str1 and str2 might came from different strings (e.g. AXCD and ABXD). Our goal is to remove any (possible) duplicates and be sure that preserved strings are unique (even if we remove more). This is why we preserve either str1 or str2. Thanks to Aaron who pointed this out.

Edit 2: There are millions of strings. That's why an algorithms is needed.

Amin.A
  • 164
  • 2
  • 13
  • 1
    Can you explain why you want to do this? What is the problem you are trying to solve? – Tomalak Mar 05 '16 at 13:15
  • 2
    You seem to be looking whether regular expressions "intersect", rather than match : there are strings matched by Reg1 that are not matched by Reg2. – Aaron Mar 05 '16 at 13:26
  • @Tomalak: Please see the edit for more information. – Amin.A Mar 05 '16 at 13:33
  • I understand even less after your edit. Where does the `AZYD` come from? What is hidden behind those `.*`? Ultimately, I don't understand how you plan to determine which patterns are errors and which are not. If the important point is string similarity, maybe the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) would help you better? – Aaron Mar 05 '16 at 13:42
  • Interesting tip @Aaron. A quick look revealed [this question](http://stackoverflow.com/questions/2966865/intersection-of-two-regular-expressions). Seems to be a solution but a slow one. I am curious to know if someone actually has a better plan for this. If not, I suggest you to put your reply as an answer. – Amin.A Mar 05 '16 at 13:43
  • Looking at the question you provided, I don't think the solutions provided are slow at runtime, and the solution based on the automaton library seems quick enough to implement. I'm not confident I've understood your problem so I won't post an answer, feel free to post your implementation as an answer – Aaron Mar 05 '16 at 13:49
  • @Aaron: I tried to explain it further. Hope it helps. On the other hand, note that I have to do this for many collections where each collection contains millions of strings. That's why I am curious to see a faster solution. But if that's the only way, then tough luck. I have to implement a fast one. – Amin.A Mar 05 '16 at 13:57
  • 1
    You type a lot of text but it's unclear at all, so difficult to understand. Just make it clear please write something likes this. 1) what do you want to do? (put some sample input and expected output) 2) what have you tried? Please don't mix all of them up! that's just confused. – fronthem Mar 05 '16 at 14:04
  • @terces907: Thanks for your comment! Hope it's more clear now. – Amin.A Mar 05 '16 at 14:22
  • 1
    Thanks for your updates, I understand it much better now. Another question though : `A?CD` and `AB?D` could correspond to two different data (i.e. `AXCD` and `ABYD`) so why decide to guess they're relative to the same data? Are you maybe sure to have some minimal number of duplicates? – Aaron Mar 05 '16 at 14:30
  • 1
    Is it some kind of glob pattern matching? If yes, you should probably replace `?` by a single dot `.`, not `.*`. – Olivier Grégoire Mar 05 '16 at 14:43
  • @Aaron: Thanks man. Good point! I added another piece to cover these exceptions. – Amin.A Mar 05 '16 at 15:34
  • The rules are still not clear. Why are Reg1 and Reg2 possible candidates but not Reg4 versus all the others? I see that Reg1 and Reg2 matching is more reliable, but how should an algorithm decide? Is a minimum number of known matching characters required or a minimum percentage or something else? – Olivier Jacot-Descombes Mar 05 '16 at 15:35
  • @Olivier No. We don't know how large the missing part is. Therefore a `.*` is required. – Amin.A Mar 05 '16 at 15:35
  • The point of my question is: Why are Reg1 and Reg2 possible duplicates but not say Reg2 and Reg4? Because Reg4 is matching all the others. (I am not talking about the length of the missing parts but the length of the non-missing parts.) – Olivier Jacot-Descombes Mar 05 '16 at 15:38
  • @OlivierJacot: Indeed the `Reg4` is a possible candidate vs. all others. We want to detect duplicates. It doesn't matter in which comparison a candidate is selected as duplicate. – Amin.A Mar 05 '16 at 15:40
  • 1
    There are 3 possible result sets in this simple example: { "A?D" }, { "A?CD", "AWXYD" } and { "AB?D", "AWXYD" } and the number grows rapidly with increasing string collection sizes. Does it mind which set you are keeping at the end? If, by accident, "A?D" happens to be the first string you encounter, then at the end only a single string will be left over. – Olivier Jacot-Descombes Mar 05 '16 at 16:01
  • Interesting point. Thanks @OlivierJacot. I would prefer to keep the ones with most information (i.e. `{"AB?D", "AWXYD"}` or `{"A?CD", "AWXYD"}`). But I understand that this procedure is order dependent. I have to figure out a way to do this. – Amin.A Mar 05 '16 at 16:29
  • I you have millions of strings, brute force algorithms won't work. If you take the naive approach of comparing each string to each other string (no matter how exactly this comparison is done), then you end up in `n*(n-1)/2` comparisons, i.e. **trillions of comparisons**! – Olivier Jacot-Descombes Mar 06 '16 at 16:01

2 Answers2

2

I don't think regular expressions is appropriate for such task. If you are asking if there is an implemented way to compare regular expressions, answer is NO. At least I haven't seen it. If you are asking if there is a way to implement it, I would say YES. You can represent regex as finite state machine as well as graph. And it is possible to check isomorphism of these things. But it would be enormously complex to do it for regular expressions. Three things that pops in my mind now is: Levenshtein distance algorithm , Binary search tree (extremely search efficient data structure) and Black Board architecture . And also here you will find several answers that can help you. Good luck!

P.S PostgreSQL has fuzzystrmatch module with Levenshtein algorithm implementation.

Community
  • 1
  • 1
callOfCode
  • 893
  • 8
  • 11
0

I think the problem is your pattern.

Reg1 --> A.*CD

and

Reg2 --> AB.*D

Sometimes, they are represented the same pattern e.g.

ABCD

Either Reg1 or Reg2 can be matched this text. That means there is some duplicated pattern inside Reg1 and Reg2.

You might solve your problem by changing your pattern to

Reg1 --> A(?!B).*CD
// (?!B) means the second character can be any letters except `B`

and

Reg2 --> A.*(?<!C)D
// (?<!C) means the second last character can be any letters except `C`

Otherwise you can not distinguish between these two patterns.

fronthem
  • 4,011
  • 8
  • 34
  • 55
  • Thanks @terces907. I am sorry but I can't see the relationship between your answer and my question. I understand where you are going. But that's not the point. Even if I rewrite the regular expressions the way you proposed, I still need to figure out how to compare them. That is the main question. – Amin.A Mar 05 '16 at 15:47
  • @Amin.A As I told you the problem is your patterns. Pattern suppose to be unique. – fronthem Mar 05 '16 at 15:51
  • `A.*CD` and `AB.*D` can match the same text e.g. `ABCD`. That means your regexes are not good enough. You suppose to find the way to identify that what is the different between these two patterns. As I already suggested. You might do in a different way but just make them different. – fronthem Mar 05 '16 at 16:02
  • Note that `intersect` of two regular expressions is not `null`, if there exists a string that both can match to (e.g. `ABCD`). @Aaron proposed to use this concept (i.e. detecting overlaps between regular expressions) to say if two strings are duplicates. We don't want to get rid of these overlaps. On the other hand, this is just a small example. It seems impractical to remove these overlaps for millions of strings. – Amin.A Mar 05 '16 at 16:41
  • Yes @Amin.A, What I try to tell you is get rid of intersect part or separate intersect part into a new pattern. There is no choice. – fronthem Mar 05 '16 at 16:47