4

I have a regex that looks like this:

/^(.*?)( tom.*)?$/

I execute it on the string

call tomorrow

My matching groups for this is going to be

1. `call`
2. ` tomorrow`

However, notice that because the second matching group is optional, the first wildcard could consume the whole string and the match would still be valid. This is exactly what happens if you make the first wildcard greedy by removing the question mark.

1. `call tomorrow`

So my question is: is there any way to instruct the regex engine that I want all valid matches to the string, not just the first one (based upon laziness/greediness)? I acknowledge that this may be slow, but it's necessary for my case.

To clarify, I want to parse the string call tomorrow and have it return:

MATCH 1
1. `call`
2. ` tomorrow`
MATCH 2
1. `call tomorrow`

When the Regex engine encounters the (.*?), it is going to consume 0 characters and then try the rest of the string. When that fails, it will try with 1 character, then 2, then 3, then 4. When it hits 4 characters, (call) the regex will parse to the end, and quit. I want a way to say "parse again, but start with that wildcard consuming 5 characters, then 6, then 7..." Eventually, it will try consuming 13 characters (call tomorrow), which will also allow the rest of the regex to parse to completion, and return that result.

Please note that this is not a question about the /g/ flag - the index of the match is not changing.

If this is not possible, is Regex the wrong tool for this application? What should I be using instead?

Brandon Horst
  • 1,921
  • 16
  • 26
  • 1
    It's hard to tell what result you're looking for. An example (say, a list of the matches you want) would really help. – T.J. Crowder Jun 01 '15 at 15:03
  • Added a clarification – Brandon Horst Jun 01 '15 at 15:07
  • 1
    If I understand correctly, you need some kind of regex tokenizer. Tokenizer breaks regexp expression based on another (!) regexp expression into groups that can be matched separately (it deconstructs regexp based on groups, ORs etc). As I recall, Robin Herbots did something like that as jquery extension for his input mask (https://github.com/RobinHerbots/jquery.inputmask/blob/3.x/js/inputmask.regex.extensions.js). You could try to extract "analyseRegex" and "validateRegexToken" from his implementation (MIT License) and tailor to your needs. – SzybkiSasza Jun 01 '15 at 15:07
  • See http://stackoverflow.com/q/11228384/1225328 as well. – sp00m Jun 02 '15 at 06:55
  • sp00m's link points me to the right place. The answer is "it's not possible without writing code myself". I figured as much, but it's good to have confirmation. If someone posts that as an answer I'll accept it. – Brandon Horst Jun 02 '15 at 13:12

2 Answers2

0

I think you can do this with an abstract capturing group, wrapping all with another group, like this:

^((.*?)( tom.*)?)$

Working demo

I know it isn't the exact output you want, but you can have this match content:

MATCH 1
1.  [0-13]  `call tomorrow`
2.  [0-4]   `call`
3.  [4-13]  ` tomorrow`

In a better graphic way, it would be:

Regular expression visualization

As a side comment, I noticed you have a blank before tomorrow maybe you like having this regex as well:

^((.*?) (tom.*)?)$
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
0

In this simple example, add another capture group, though you'll need to deal with duplicates.

> re = /^((.*?)( tom.*)?)$/
> console.log('call tomorrow'.match(re))
["call tomorrow", "call tomorrow", "call", " tomorrow", index: 0, input: "call tomorrow"]

For more complicated cases, you need to write a loop yourself. These answers have some good ideas:

Community
  • 1
  • 1
Kristján
  • 18,165
  • 5
  • 50
  • 62