3

As explained in Can regular expressions be used to match nested patterns?, it is not possible to create regex to match arbitrary nested pattern. But is it possible to create an algorithm that would generate a regex of n-th level of "nesteness"?

basically, i want to replace trim(whatever) with rtrim(ltrim(whatever))

i managed to create 3 levels by hand (javascript syntax):

level[1] = /\(([^()]*)\)/g
level[2] = /\(((?:[^()]*\([^()]*\))*[^()]*)\)/g
level[3] = /\(((?:(?:(?:[^()]*\([^()]*\))*[^()]*)*\((?:(?:[^()]*\([^()]*\))*[^()]*)*\))*[^()]*)\)/g

here are some test data:

1st(ddd) + 1st(ddd)
2nd(dd(d))
3rd(a(b) + (cd(h) + d(dfas) + zzz))
4th(a(b(c(d))))
8th(a(b(c(d(e(f(g()))))))

i know that at every level [^()]* needs to be replaced with noncapturing group that can contain parentheses, but i'm not sure how to generalize the algoritm for n-th level...

Community
  • 1
  • 1
Aprillion
  • 21,510
  • 5
  • 55
  • 89

1 Answers1

6

You can think about it more theoretically: a match for parenthesis nested n deep is just parenthesis around matches for n-1 or less deep (with at least one exactly n-1 deep).

We can give a recursive definition of the regexes. Let X[n] be the regex for nesting exactly n levels, and Y[n] be the regex for a string containing brackets with any level of nesting up to n levels, so:

X[n] = \( (Y[n-2] X[n-1])+ Y[n-2] \)

Y[n] = [^()]* ( \( Y[n-1] \) [^()]* )*

with Y[0] = X[0] = [^()]* (no nesting) and X[1] = \([^()]*\). (I'm not bothering with the details of non-capturing groups etc yet, and the spaces are just for readability.)

Writing an algorithm based on this should be quite easy.


The regexes from these new (less mutually recursive) definitions get longer much much more slowly (they are polynomial rather than exponential).

Let l[n] be the length of X[n], and L[n] be the length of Y[n], then (the constant terms are just the hardcoded characters in each one):

L[n] = 19 + L[n-1] = 19*n + L[0] = 19*n + 6

l[n] = 3 + L[n-2] + l[n-1] + 2 + L[n-2] + 2
     = 7 + 2 * L[n-2] + l[n-1]
     = -57 + 38 * n + l[n-1]

with the appropriate initial conditions for l[0] and l[1]. Recurrence relations of this form have quadratic solutions, so this is only O(n^2). Much better.

(For others, I had a previous definition of Y[n] was Y[n] = Y[n-1] | X[n]; this extra recursion meant that the length of the X regex was O(2.41^n), which sucks a lot.)

(The new definition of Y is a tantalising hint that there might even be a way of writing X that is linear in n. I don't know though, and I have a feeling the extra restriction on X of exact length means it is impossible.)


The following is some Python code that computes the regexes above, you can probably translate it to javascript without too much trouble.

# abbreviation for the No Parenthesis regex
np = "[^()]*"

# compute Y[n] from Y[n-1]
def next_y(y_n1):
    return np + "(?:\(" + y_n1 + "\)" + np + ")*"

# compute X[n] from X[n-1] and Y[n-2]
def next_x(x_n1, y_n2):
    return "\((?:" + y_n2 + x_n1 + ")+" + y_n2 + "\)"

# compute [X[n], Y[n], Y[n-1]]
# (to allow us to make just one recursive call at each step)
def XY(n):
    if n == 0:
        return [np, # X[0]
                np, # Y[0]
                ""] # unused
    elif n == 1:
        return ["\([^()]*\)", # X[1]
                next_y(np),   # Y[1]
                np]           # Y[0]

    x_n1, y_n1, y_n2 = XY(n-1) # X[n-1], Y[n-1], Y[n-2]

    return [next_x(x_n1, y_n2), # X[n]
            next_y(y_n1),       # Y[n]
            y_n1]               # Y[n-1]

# wrapper around XY to compute just X[n]
def X(n):
    return XY(n)[0]

# wrapper around XY to compute just Y[n]
def Y(n):
    return XY(n)[1]
huon
  • 94,605
  • 21
  • 231
  • 225
  • thanks, O() isn't a problem, i hope noone will use more than 10 nested parentheses, i'll have a look at it during weekend :) – Aprillion May 04 '12 at 12:32
  • 1
    @deathApril, I just vastly improved the answer by changing the definition of `Y`. This reduces the size of the regex for `n=10` from ~45000 to ~1500 (so even for `10` the O() was making the regex huge!). (I understand you aren't worried about large `n`, but I find this question very interesting (thanks for asking it!), so I'm doing the analysis anyway.) – huon May 04 '12 at 14:44
  • @deathApril, and now there is some code (although Python rather than Javascript). – huon May 04 '12 at 15:08
  • glad to hear it is so interesting for you - less work for me ;)) python is ok,, i'll just run it and use the resulting regex in a notepad++ macro (or i create a simple web app to do it, if other team members would want to use it) - i will accept after i test it a bit more ;) – Aprillion May 04 '12 at 16:51