0

I am having a tough time figuring out a clean Regex (in a Javascript implementation) that will capture as much of a line as it can following a pattern, but anything inside braces doesn't need to follow the pattern. I'm not sure the best way to explain that except by example:

For example: Let's say the pattern is, the line must start with 0, end with a 0 anywhere, but only allow sequence of 1, 2 or 3 in between, so I use ^(0[123]+0). This should match the first part of the strings:


    0213123123130
    012312312312303123123
    01231230123123031230
    etc.

But I want to be able to insert {gibberish} between braces into the line and have the Regex allow it to disrupt the pattern. i.e., ignore the pattern of the curly braces and everything inside, but still capture the full string including the {gibberish}. So this would capture everything in bold:


    01232231{whatever 3 gArBaGe? I want.}121{foo}2310312{bar}3120123

and a 0 inside the braces does not end the capture prematurely, even if the pattern is correct.


    01213123123123{21310030123012301}31231230123

EDIT: Now, I know I could just do something like ^0[123]*?(?:{.*})*?[123]*?0 maybe? But that only works if there is a single set of braces, and now I have to duplicate my [123] pattern. As that [123] pattern gets more complex, having it appear more than once in the Regex starts getting really incomprehensible. Something like the best regex trick seemed promising but I couldn't figure out how to apply it here. Using crazy lookarounds seems like the only way now but I would hope there's a cleaner way.

Trevor Buckner
  • 588
  • 2
  • 13
  • I am ignoring the pattern inside the braces, yes, but I still want to include the contents of the braces in my capture. `0123{5}670` is not a match because it contains `67` outside of the braces, which is not included in `[123]`. I include examples with`0` inside the braces to show that the pattern should not end prematurely inside braces even if the pattern continues correctly. – Trevor Buckner May 26 '20 at 14:18
  • @CarySwoveland I have made a couple changes to the text to clarify. – Trevor Buckner May 26 '20 at 17:50
  • @CarySwoveland In the edit I did change the question to state that 'the line must start with 0'. The answer I selected was what I needed, minus the `^`. – Trevor Buckner May 26 '20 at 18:57

4 Answers4

4

Since you've specified that you want the whole match including the garbage, you can use ^0([123]+(?:{[^}]*}[123]*)*)0 and use $1 to get the part between the 0s, or $0 to get everything that matched.

https://regex101.com/r/iFSabs/3

Here's the rundown on how the regex works:

  • ^ anchors the match to start at the beginning of the line
  • 0 matches a literal zero character
  • ([123]+(?:{[^}]*}[123]*)*) is a capturing group that captures everything inside of it.
    • [123]+ matches one or more instances of 1, 2, or 3
    • (?:{[^}]*}[123]*)* is a non-capturing group. I.e. it'll be part of the match, but won't have a $# for use in replacement or the match.
      • {[^}]*} matches a literal { followed by any number of non } characters followed by }
      • [123]* matches zero or more instances of 1, 2, or 3
      • Then this whole non-capturing group can be matched 0 or more times.

The process behind this regex is known as unrolling the loop. http://www.softec.lu/site/RegularExpressions/UnrollingTheLoop gives a good description of it. (with a few typo fixes)

The unrolling the loop technique is based on the hypothesis that in most case, you [know] in a [repeated] alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:

normal* ( special normal* )*

Which could means something like, match the normal case, if you find a special case, matched it than match the normal case again. [You'll] notice that part of this syntax could [potentially] lead to a super-linear match.

Example using Regex#test and Regex#match:

const strings = [
  '0213123123130',
  '012312312312303123123',
  '01231230123123031230',
  '01213123123123{21310030123012301}31231230123',
  '01212121{hello 0}121312',
  '012321212211231{whatever 3 gArBaGe? I want.}1212313123120123',
  '012321212211231{whatever 3 gArBaGe? I want.}121231{extra garbage}3123120123',
];
const regex = /^0([123]+(?:{[^}]*}[123]*)*)0/

console.log('tests')
console.log(strings.map(string => `'${string}': ${regex.test(string)}`))


console.log('matches');
let matches = strings
  .map((string) => regex.exec(string))
  .map((match) => (match ? match[1] : undefined));
console.log(matches);

Robo Robok's answer is where I'd go with if you want to only keep the non braced part, although using a slightly different regex ({[^}]*}) for a bit more performance.

Zachary Haber
  • 10,376
  • 1
  • 17
  • 31
  • I borrowed your regex101 code and changed it just a bit. It looks like this might work? `^0([123]+(?:{[^}]*})*?[123]*)*0` But I'm concerned that it takes [half a million steps](https://regex101.com/r/SJMtt9/1) to make 7 matches? – Trevor Buckner May 26 '20 at 02:00
  • 1
    The regex I've written matches everything without having to be updated. And only takes 108 steps. `0[123]+(?:{[^}]*}[123]*)*0`. If you need to match only the things between the 0s, `0([123]+(?:{[^}]*}[123]*)*)0` will work for that. – Zachary Haber May 26 '20 at 02:04
  • 1
    If you need to match everything, `^(0(?:[123]|{.+?})+0)` will work without needing to duplicate the `[123]` – Mogzol May 26 '20 at 02:11
  • This is very true. The duplication in mine is intentional for the sake of optimization. For a small amount of matches, this won't be an issue, but if there's a large amount of matches (or large length of garbage sections), using alternation and lazy operators might not be the best. – Zachary Haber May 26 '20 at 02:16
  • The last comments here by ZacharyHaber and @Mogzol are the closest to what I am trying to do by capturing everything in a single capture. Thanks for the help. – Trevor Buckner May 26 '20 at 15:24
  • 1
    @CarySwoveland, Sorry for the confusion. I must have missed the caret, but thank you for putting the carrot before the stick and letting me know my mistake. I took the opportunity while editing to add more information on how the regex works and the premise behind it's structure. – Zachary Haber May 26 '20 at 20:22
1

How about the other way around? Checking the string with curly tags removed:

const string = '012321212211231{whatever 3 gArBaGe? I want.}1212313123120123{foo}123';
const stringWithoutTags = string.replace(/\{.*?\}/g, '');

const result = /^(0[123]+0)/.test(stringWithoutTags);
Robo Robok
  • 21,132
  • 17
  • 68
  • 126
  • I see where this is going, but I also need to capture that matching segment, including all of the `{gibberish}` in the original string, not just test that it is true or false. Is there a way to do that with this method? – Trevor Buckner May 26 '20 at 01:54
1

You say you need to capture everything, including the gibberish, so I think a simple pattern like this should work:

^(0(?:[123]|{.+?})+0)

That allows a string starting with 0, and then any of your pattern characters (1, 2, or 3), or one of the { gibberish } sections, and allows that to repeat to handle multiple gibberish sections, and finally it must end with a 0.

https://regex101.com/r/K4teGY/2

Mogzol
  • 1,405
  • 1
  • 12
  • 18
1

You might use

^0[123]*(?:{[^{}]*}[123]*)*0
  • ^ Start of string
  • 0 Match a zero
  • [123]* Match 0+ times either 1, 2 or 3
  • (?: Non capture group
    • {[^{}]*}[123]* match from an opening till closing } followed by 0+ either 1, 2 or 3
  • )* Close group and repeat 0+ times
  • 0 Match a zero

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70