3

I have written a regular expression to match some tags that look like this:

@("hello, world" bold italic font-size="15")

I want the regular expression to match these strings: ['hello, world', 'bold', 'italic', 'font-size="15"'].

However, only these strings are matched: ['hello, world', 'font-size="15"'].

Other examples:

  1. (success)@("test") -> ["test"]
  2. (success)@("test" bold) -> ["test", "bold"]
  3. (fail)@("test" bold size="15") -> ["test", "bold", 'size="15"']

I have tried using this regular expression:

\@\(\s*"((?:[^"\\]|\\.)*)"(?:\s+([A-Za-z0-9-_]+(?:\="(?:[^"\\]|\\.)*")?)*)\s*\)

A broken down version:

\@\(
  \s*
  "((?:[^"\\]|\\.)*)"
  (?:
    \s+
    (
      [A-Za-z0-9-_]+
      (?:
        \=
        "(?:[^"\\]|\\.)*"
      )?
    )
  )*
  \s*
\)

The regular expression is trying to

  1. match beginning of the sequence ($(),
  2. match a string with escaped characters,
  3. match some (>= 1) blanks,
  4. (optional, grouped with (5)) match a = sign,
  5. (optional, grouped with (4)) match a string with escaped characters,
  6. repeat (3) - (5)
  7. match end of the sequence ())

However, this regular expression only matches "hello, world" and font-size="15". How can I make it also match bold and italic, i.e. to match the group ([A-Za-z0-9-_]+(?:\="(?:[^"\\]|\\.)*")?) multiple times?

Expected result: ['"hello, world"', 'bold', 'italic', 'font-size="15']

P.S. using JavaScript native regular expression

  • Is the string a standalone one or are you trying to match it inside a larger text? – Wiktor Stribiżew May 09 '16 at 11:45
  • Inside a larger text, actually a markdown. The group is matched using the `String.match` function and then each case is handled with another function. – user3186610 May 09 '16 at 11:47
  • Could you provide some more examples of things you want to match/not match? The regex you're written looks incredibly complicated for what should hopefully be a simple task. – Tom Lord May 09 '16 at 11:48
  • Possible duplicate of [Javascript regex multiple captures again](http://stackoverflow.com/questions/14707360/javascript-regex-multiple-captures-again) – Kuba Wyrostek May 09 '16 at 11:49
  • @KubaWyrostek no, I tried and it didn't work. – user3186610 May 09 '16 at 11:52
  • @TomLord let me edit it... – user3186610 May 09 '16 at 11:52
  • @TomLord It is not an easy task to parse code/markdown, especially with a regex. Actually, I do not think it is really a good idea to do it with a regex. Ok, if a regex should be used, then there should be 2 steps: 1) extracting with [`@\((?:\s*(?:"[^"\\]*(?:\\.[^"\\]*)*"|[\w-]+(?:="?[^"\\]*(?:\\.[^"\\]*)*"?)?))+\s*\)`](https://regex101.com/r/yH4wA0/1), 2) tokenizing the match with [`(?:"([^"\\]*(?:\\.[^"\\]*)*)"|[\w-]+(?:="?[^"\\]*(?:\\.[^"\\]*)*"?)?)`](https://regex101.com/r/yH4wA0/2). – Wiktor Stribiżew May 09 '16 at 11:56
  • @WiktorStribiżew your solution is right, and actually this is (originally) a simple regex but a escaped string match is embed inside. Anyway, this is not a duplicate, given the correct solution you provided... also, can you answer this question here? – user3186610 May 09 '16 at 12:01

1 Answers1

2

You need a 2-step solution:

Example code:

var re = /@\((?:\s*(?:"[^"\\]*(?:\\.[^"\\]*)*"|[\w-]+(?:="?[^"\\]*(?:\\.[^"\\]*)*"?)?))+\s*\)/g; 
var re2 = /(?:"([^"\\]*(?:\\.[^"\\]*)*)"|[\w-]+(?:="?[^"\\]*(?:\\.[^"\\]*)*"?)?)/g;
var str = 'Text here @("hello, world" bold italic font-size="15") and here\nText there @("Welcome home" italic font-size="2345") and there';
var res = [];

while ((m = re.exec(str)) !== null) {
    tmp = [];
    while((n = re2.exec(m[0])) !== null) {
      if (n[1]) {
        tmp.push(n[1]);
      } else {
        tmp.push(n[0]);
      }
    }
    res.push(tmp);
}
document.body.innerHTML = "<pre>" + JSON.stringify(res, 0, 4) + "</pre>";
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks so much, and thanks about the tips about using `RegExp.exec` - however, I'm using `String.replace` with callback and it works well! Thanks for your effort in debugging this long RegExp! – user3186610 May 09 '16 at 12:15
  • Just FYI: I unrolled the `"(?:[^"\\]|\\.)*"` with `"[^"\\]*(?:\\.[^"\\]*)*"` for better performance. I will perhaps add more details once I have more time. – Wiktor Stribiżew May 09 '16 at 12:19
  • Thanks for that and this is just for my own small project - I'm trying to make a fill-in-the-blank generator (markdown -> pdf), so performance won't be a very important factor - although yes, regex performance can be very bad sometimes :) Thanks – user3186610 May 09 '16 at 13:21