1

I am trying to put together a Regex in Javascript that captures everything between two delimiters (hyphens -) like so (captured text in bold):


    -Hello everybody!- I'm here!

However, I also utilize braces to denote special information like so: This is {special} stuff here. Any hyphen delimiters found inside braces should be ignored:


    -This {stuff} matches- here.
    -No match found {here-}.
    -But this last {hyphen-} works-

Further, the hyphens should only match the outermost, even-numbered hyphen. That is, there may be additional hyphens inside but only in even-numbered pairs:


    -some -inner hyphens- inside- here.
    -some --inner hyphens- but the odd one outside -.
    -inner hyphens -inside- {braces-} -are- still ignored in this count- as usual-
    -Hyphens both before- and after {the-braces} must be counted-

Further, I need to balance all of this while still allowing half-braces to work. That is, if there is no closing brace, the hyphens are legal, and same if there is no opening brace:


    -This is {fine-
    -And so is this- here}

I can get the even-numbered hyphens to match with something like ^-((?:[^-]|(?:[^-]*?-){2})*)-, (each of my cases starts with a - at the beginning of the line) but adding the braces to this is going over my head. I asked an earlier related question here about ignoring text inside braces, but the context is different enough that I can't seem to wrestle it in there.

Trevor Buckner
  • 588
  • 2
  • 13

1 Answers1

1

I modified your regex to ignore all hyphens inside braces:

/^-((?:(?:(?!{.*?})[^-\n]|{.*?})|(?:(?:(?!{.*?})[^-\n]|{.*?})*?-){2})*)-/

Demo can be found here. The demo includes \n in the [^-] parts, because it takes the whole file as input instead of going line-by-line.

Compared to your original regex, I've replaced both [^-] parts with this:

(?:(?!{.*?})[^-]|{.*?})

This piece of logic ensures that hyphens between braces are not counted. It has to be included twice to ensure that hyphens between braces are also skipped in 'the even-counter'.

I've used a lookahead conditional to check if there's a matching closing brace, rewriten as described here to support javascript. If a closing brace is found, we match the entire part between the braces unconditionally. If no closing brace can be found, we just treat it as a normal character.

It does not yet support nested braces correctly, but that can be added using a similar construct as you used for hyphens.

Why is the negative lookahead necessary?

The goal is to skip all hyphens between braces. To explain how it works, let's consider the following, much simpler regex: ^-(?:[^-]|{.*?})*-. This regex tries to find the next hyphen that is not between braces.

The [^-] part consumes any character that is not a hyphen. The regex executor will walk over the string, character by character, until it encounters a hyphen.

-No match found {here|-}.

It will not use the other option, because the first one suffices. In this position, the next character is a hyphen. This will match the last hyphen in the regex, finishing the matching proceidure.

-No match found {here-|}.

Unforutnately, this hyphen is between braces and should have been ignored.


One could try to change the order of the options like so: ^-(?:{.*?}|[^-])*-. If we try another example, we'll see that it works correctly:

-But this last {hyphen-} works-|

However, when we use the original example, something goes wrong. The difference occurs in this position:

-No match found |{here-}.

Here, the regex executor first tries to jump over the content between braces, like so:

-No match found {here-}|.

It will then fail to find a hyphen in the rest of the string. But the regex executor isn't stupid. There are two options, so it will just try the second option at the opening brace. This allows the executor to enter the content between the braces, and it will find the hyphen in there:

-No match found {here-|}.

When we add in the negative lookahead, the regex looks like this: ^-(?:(?!{.*?})[^-]|{.*?})*-. Again, the difference is at this position:

-No match found |{here-}.

Here, the negative lookahead matches. This forces the regex executor to use the other option. Doing so results in the following situation:

-No match found {here-}|.

The regex executor jumped over the entire part between braces, and the negative lookahead ensures that backtracking won't change this. Because the second hyphen is in between the braces and the regex executor won't enter the braces, it can't find the second brace, and will mark this string as 'no match'.

Jager567
  • 617
  • 4
  • 10
  • This is really good. The one problem is if there are hyphens before the first braces then they don't seem to get counted in the even-counter. For instance something like `-inner -hyphens {before-} the braces should still count toward even odd-` is matching on the last hyphen instead of the second one. Any thoughts? – Trevor Buckner May 26 '20 at 19:46
  • I added a new test case to the question `-Hyphens both before- and after {the-braces} must be counted-` that shows this. – Trevor Buckner May 26 '20 at 19:53
  • @TrevorBuckner I've updated the answer to fix the issue. – Jager567 May 26 '20 at 21:31
  • This seems to work perfectly now. Do you have any intuition why the conditional needs the negative lookahead in addition to the positive one? I would assume you just need the first one, but of course if I remove it it doesn't work anymore. – Trevor Buckner May 27 '20 at 14:30
  • 1
    @TrevorBuckner It's actually the other way around. You actually only need the negative lookahead, the positive one can be safely removed. I tried to explain it, but it got a little out of hand... Please see the edited answer – Jager567 May 27 '20 at 16:57
  • Thanks. This helps a lot! Removing the positive lookahead (or maybe reordering the alternation?) also cut down the number of steps by 20% so that's even better. – Trevor Buckner May 28 '20 at 00:09