Optimizing a Huge "OR" Pattern

Question

However, the entries that form the pattern itself are few (X, Y, Z, A, B, K, M, and J). So, converting (optimizing) that pattern to a pattern like this: (X ((Y (Z | (A B))) | (A K))) | (M ((A ((J K) | B)) | Z)) , will most probably solve my memory problems.

I need an algorithm to take the input pattern (as string maybe) and produce the optimized one (as string also maybe).

Sorry, we cannot help with that here. Try writing it yourself, and come back if you encounter problems and have questions. — kapa, May 01 '13 at 14:36
in what language even? because this is easy, if i knew what i was answering for it would take me a minute or two only — AngryDuck, May 01 '13 at 14:43
I have tried solving this problem, but failed. Maybe it needed extra effort. @AngryDuck I need an algorithm not a code. This may look trivial, but I failed to implement it. — Hesham Eraqi, May 01 '13 at 14:46
an algorithm not code..... ok... so how do you intend to write / implement this algorithm — AngryDuck, May 01 '13 at 14:47
What's the optimally reduced equivalent clause for a given non-optimal clause? — G. Bach, May 01 '13 at 14:56
you know what nevermind its sounds like you dont know what you want i was going to write some code that would get a string and then return an "optimised" one but theres not enought info here — AngryDuck, May 01 '13 at 15:34
@AngryDuck I'm going to implement the needed algorithm using MATLAB. But it doesn't matter, as I still can code in any other languages and generate an executable to be called from MATLAB. By the way, this problem arouse when I was trying to use the HTK toolkit [link](htk.eng.cam.ac.uk) with an big Grammar data. — Hesham Eraqi, May 01 '13 at 15:49
What do you mean by "entries" (X,Y,...)? Are they placeholders for RegExp pattern (`X=/\d+/`, `Y=/foo(bar)?/`, ...), or they simply match itselves (`(XYZ|XZ)` matches `AXXYZP`)? — viercc, May 01 '13 at 21:26
I have symplified this problem into two sub-problems and managed to solve it. You can check this question out for more details: [link](http://stackoverflow.com/questions/16342579/graph-paths-abstraction-algorithm-needed) — Hesham Eraqi, May 06 '13 at 11:53

score 3 · Accepted Answer · answered May 01 '13 at 14:47

3

Note that (XA)|(XB) = X(A|B). Based on this property, I can suggest the following greedy solution:

Let P be the expression, and X be the most common entry in P: P = (XR1)|(XR2)|...|(XRn)|Q. Then, taking X out of parenthesis, P can be expressed as P = XR | Q, where R = (R1|R2|...|Rn). After this is done, solve the problem recursively for R and Q.

answered May 01 '13 at 14:47

Grigor Gevorgyan

6,753
4
35
64

This sounds a logical answer. Thanks :) – Hesham Eraqi May 01 '13 at 15:50
If these are sequences, then it's also true that `(AX)|(BX) = (A|B)X`, but your approach won't simplify these. (Also true if they're booleans, but simpler to deal with: just put all of them in a consistent order before you start.) – rici May 01 '13 at 17:57
1

@rici It is not a simple work to be able to simplify both pattern. It is true that `AB|AMY|XY = A(B|MY)|XY = AB|(AM|X)Y` and not able to simplify both `A` and `Y`. Isn't it difficult to determine which is better? – viercc May 01 '13 at 21:06
@viercc: there are objective metrics. For example, you could use the operator count of the resulting expression. But that's not the point; there are cases where it's clearly important to simplify on the right, like: `(AXYZ)|(BXYZ)`. Not being able to do so at all is sub-optimal, and there's no obvious reason to expect that the majority of the simplifications are prefix rather than suffix. – rici May 01 '13 at 23:06
@rici I understand. It was clear that getting more ability to simplify is an improvement. – viercc May 02 '13 at 01:09
It would be useful if you check out [this question](http://stackoverflow.com/questions/16342579/graph-paths-abstraction-algorithm-needed). I have symplified this problem into two sub-problems and managed to solve it. – Hesham Eraqi May 21 '13 at 09:13

score 1 · Answer 2 · edited May 23 '17 at 10:25

In general, simplification of boolean expressions is NP-hard (see, for example, Is minimization of boolean expressions NP-Complete?).

If you have at most 8 literals or non-negated variables, there are at most 256 possible min-terms, which is not a huge number, so I suppose you have many more variables than 8. Consider using the Quine-McCluskey Method to simplify your expression instead of some ad hoc method. Or, if the number of variables is large but not huge (eg, less than some small multiple of 64) represent each min-term as a bit-mask and OR terms together as you read them in, rather than evaluating symbolically.

rici · Answer 3 · 2013-05-01T15:53:51.360

From the use of the regex tag, I deduce that the expression which is to be optimised is a regular expression, not a boolean expression.

Regular expressions can be easily reduced to DFAs (state machines) although the number of states is potentially exponential in the size of the regular expression. Once you have the DFA, you can use one of the well-known algorithms for minimizing DFAs; it's not too difficult to do this in O(n log n) where n is the number of states.

A good regex library should be able to do all this for you, but many don't. You might want to take a look at Ragel.

If the DFA for your regular expression is in fact much larger than the regular expression, which is rare but not completely unknown, you might find that the above procedure does blow up memory. For this reason, many regular expression libraries do not perform the full DFA reduction; rather, they reduce only to an NFA and then perform the powerset construction lazily, caching the results. This should avoid memory issues at the cost of increased scan time.

An example of a regular expression which shows exponential blow-up:

A(A|B)(A|B)(A|B)(A|B)(A|B)(A|B)

The corresponding DFA must have at least 32 states (that is, 2⁵ where 5 is the number of repetitions of (A|B).

Thanks. I'll take a look on Ragel, and will be back to you :) — Hesham Eraqi, May 01 '13 at 16:31

Optimizing a Huge "OR" Pattern

3 Answers3