6

Possible Duplicate:
How to auto generate regex from given list of strings?

I have two lists of strings ListA and ListB. I need to generate a regular expression that will match all strings in ListA and will not match any string in ListB.

  • The strings could contain any combination of characters, numbers and punctuation.
  • If a string appears on ListA it is guaranteed that it will not be in the ListB.
  • If a string is not in either of these two lists I don't care what the result of the matching should be.

The lists typically contain thousands of strings, and strings are fairly similar to each other.

I know the trivial answer to this question, which is just generate a regular expression of the form (Str1)|(Str2)|(Str3) where StrN is the string from ListA. But I am looking for a more efficient way to do this.

Ideal solution would be some sort of tool that will take two lists and generate a Java regular expression for this.

Update 1: By "efficient", I mean to generate expression that is shorter than trivial solution. The ideal algorithm would generate the shorted possible expression. Here are some examples.

ListA = { C10 , C15, C195 }
ListB = { Bob, Billy }

The ideal expression would be

/^C1.+$/

Another example, note the third element of ListB

ListA = { C10 , C15, C195 }
ListB = { Bob, Billy, C25 }

The ideal expression is

/^C[^2]{1}.+$/

The last example

ListA = { A , D ,E , F , H } ListB = { B , C , G , I }

The ideal expression is the same as trivial solution which is

/^(A|D|E|F|H)$/

Also, I am not looking for the ideal solution, anything better than trivial would help. I was thinking along the lines of generating the list of trivial solutions, and then try to merge the common substrings while watching that we don't wander into ListB territory.

**Update 2*: I am not particularly worried about the time it takes to generate the RegEx, anything under 10 minutes on the modern machine is acceptable

Community
  • 1
  • 1
Vlad
  • 9,180
  • 5
  • 48
  • 67
  • 1
    Why does it need to be a regular expression? If you want a performant matching test, use a [trie](http://en.wikipedia.org/wiki/Trie) – Bergi Oct 14 '12 at 22:16
  • 1
    Are you sure this can be done using only regular expressions ? It seems to me it needs an algorithm. Also this needs more explanation: **" ...a regular expression that will match all strings in ListA and will not match any string in ListB..."** – Tulains Córdova Oct 14 '12 at 22:18
  • 1
    I agree with Bergi -- regex seems like the wrong tool for this. – Ted Hopp Oct 14 '12 at 22:18
  • 2
    What is ListB needed for? If the expression only matches strings from ListA (which are guaranteed not to be in B), why would it need to check additionally for ListB? – Bergi Oct 14 '12 at 22:18
  • Show examples of the lists, so we can see their similarity, and possibly use it to identify a good method for matching – Billy Moon Oct 14 '12 at 22:23
  • @bergi I agree that RegEx is not an ideal solution. But I have no choice, the lists have to be fed into the existing system that only has API to separate the records by RegEx. I didn't write that system, so don't ask me why is that. This is something I have to live with. – Vlad Oct 14 '12 at 22:54
  • @tedHopp Ted, look at my answer to Bergi – Vlad Oct 14 '12 at 22:55
  • @Billy the strings are arbitrary. They might have a common prefix, but the rest is completely random. I am looking into the way to generate regex automatically over arbitrary set of strings. So knowing structure wouldn't help. – Vlad Oct 14 '12 at 22:57
  • 1
    @Vlad, you shouldn't need to worry about things like factoring out common prefixes. The regex library will do that when it compiles the regex into a finite-state machine. – Wyzard Oct 14 '12 at 23:03
  • @Wyzard good to know, still it would be great to how to do it myself. – Vlad Oct 14 '12 at 23:07
  • Last example had `F` in both lists, which contradicts your previous assertion that they never co-exist in both lists. Please clarify. – Billy Moon Oct 15 '12 at 01:41
  • @BillyMoon sorry my bad.. corrected.. – Vlad Oct 15 '12 at 02:02
  • @Vlad you should check the link in the `possible duplicate` comment, it has an implementation in c# that exactly fits with what you are asking. – Billy Moon Oct 15 '12 at 15:04
  • 1
    http://cstheory.stackexchange.com/questions/1854/is-finding-the-minimum-regular-expression-an-np-complete-problem – Andrew Cheong Oct 18 '12 at 18:55
  • @acheong87 wow. That's exactly the answer I was looking for. – Vlad Oct 18 '12 at 20:33

1 Answers1

0

If it's guaranteed that no string will be in both lists, and you don't care about strings that are in neither list, then you just need to match strings in ListA; you can ignore ListB entirely.

The "trivial answer" that you mentioned is an entirely reasonable solution. When you say you want a "more efficient" way, do you mean an efficient way to generate the regular expression itself, or a way to generate a regular expression that matches strings more efficiently?

  • If you want to generate the regex efficiently, most languages' string facilities provide a way to join a list of strings with a separator string (such as a comma) to produce a single string. You can't really get more efficient than that.
  • If you want your expression to match things efficiently, make sure that you "compile" it before you use it. (Most regex libraries have a function for this.) Compiling a regular expression means generating the finite-state machine that the regex library actually uses for matching operations. Any decent regex library should do a decent job of optimizing the FSM, for example mapping common substrings to the same FSM state where possible.

Alternatively, you could forego regular expressions entirely, and just iterate over ListA and compare each of its strings against the candidate string. The individual comparisons may be faster in this case, since looking for an exact string match can compare strings in 4- or 8-byte chunks, whereas a regular expression has to look at each character individually. But if you have lots of strings to compare against, you'll be traversing the candidate string many times in memory. A regular expression, conversely, can traverse the candidate string once to find out whether it matches.

Try both approaches. See which is faster.

Wyzard
  • 33,849
  • 3
  • 67
  • 87
  • "*let the regex compiler do the optimsation*" is a good idea - just pass in the trivial regex. Of course this depends on the environment, we'd need more information about that from @Vlad. – Bergi Oct 14 '12 at 23:10
  • This does not work, compile doesn't simplify, in fact this is in general an NP complete problem if it defined generally enough (kolmogorov) and is surely not trivial to define correctly in order to get what you want, take a look at the MDL (Minimal Descriptive Length) principle for instance. – Veltzer Doron Jan 08 '20 at 20:09