1

I'm trying to string.matchAll the following string:

const text = 'textA [aaa](bbb) textB [ccc](ddd) textC'

I want to match the following:

  • 1st: "textA [aaa](bbb)"
  • 2nd: " textB [ccc](ddd)"
  • 3rd: " textC"

NOTE: The capturing groups are already present in the regex. That's what I need.

It's almost working, but so far I couldn't think of a way to match the last part of the string, which is just " textC", and doesn't have the [*](*) pattern.

What am I doing wrong?

const text = 'textA [aaa](bbb) textB [ccc](ddd) textC'
const regexp = /(.*?)\[(.+?)\]\((.+?)\)/g;

const array = Array.from(text.matchAll(regexp));
console.log(JSON.stringify(array[0][0]));
console.log(JSON.stringify(array[1][0]));
console.log(JSON.stringify(array[2][0]));

UPDATE:

Besides the good solutions provided in the answers below, this is also an option:

const text= 'textA [aaa](bbb) textB [ccc](ddd) textC'

const regexp = /(?!$)([^[]*)(?:\[(.*?)\]\((.*?)\))?/gm;

const array = Array.from(text.matchAll(regexp));

console.log(array);
cbdeveloper
  • 27,898
  • 37
  • 155
  • 336
  • try this : (\w+)\s*(?:\[(.+?)\]\((.+?)\))? – SEDaradji Jun 17 '19 at 18:48
  • Anything wrong with `(.+\)) (.+\)) (.+)`? – CAustin Jun 17 '19 at 18:51
  • [My answer](https://stackoverflow.com/a/56637630/3832970) will work to split any string with any pattern while keeping the matched text in the left-hand split chunk. Is it working for you? Are you sure the result you need is the one you showed in the question? Is `textC` a placeholder and it can just be equal to `word 1 word 2 and word 3 and so on....` and you need to get this text as a single item in the resulting array? – Wiktor Stribiżew Jun 18 '19 at 09:03

3 Answers3

2

It's because there is no third match. After the first two matches, the only thing left in the string is "text C":

https://regex101.com/r/H9Kn0G/1/

to fix this, make the whole second part optional (also note the initial \w instead of . to prevent that dot from eating the whole string, as well as the "grouping only" parens used to surround the optional part, which keeps your match groups the same):

(\w+)(?:\s\[(.+?)\]\((.+?)\))?

https://regex101.com/r/Smo1y1/2/

Scott Weaver
  • 7,192
  • 2
  • 31
  • 43
  • 1
    The word character is too restrictive for me. I want to match ANY string followed by the pattern `[+](+)`, and if multiple patterns `[+](+)` are written together one after the other I want to match them 1 by 1. – cbdeveloper Jun 18 '19 at 09:05
  • to match literally anything up to that next bracket, try this: `((?:(?!\[).)+)(?:\s?\[(.+?)\]\((.+?)\))?`. https://regex101.com/r/HqbTpU/1/ I added a 'tempered token' with that negative lookahead, more complex obviously. – Scott Weaver Jun 18 '19 at 09:17
  • 1
    @ScottWeaver Please never use a tempered greedy token when you restrict a `.` with a single char. `(?:(?!\[).)+` (almost) = `[^[]+`. It is equal to something like `[^[\n\r]+` in fact. The negated character class works much faster. – Wiktor Stribiżew Jun 18 '19 at 09:19
  • yes, that's a little simpler and works the same. https://regex101.com/r/vMFKXH/1/ – Scott Weaver Jun 18 '19 at 09:21
  • Besides, [your regex solution is easy to break](https://regex101.com/r/HqbTpU/3) if there are "standalone" brackets before `[...](...)` construction. – Wiktor Stribiżew Jun 18 '19 at 09:21
  • every regex solution is easy to break. the robustness required is up to OP - sometimes regex is better, sometimes algorithmic approaches are needed. – Scott Weaver Jun 18 '19 at 09:22
  • The exact solution that I need is `([^[]*)(?:\[(.+?)\]\((.+?)\))?` , but it matches the last position as a zero-length. If I change it to `([^[]+)(?:\[(.+?)\]\((.+?)\))?` requiring the first group to have at least 1 char, I get rid of the last position match, but I can't match multiple patterns of the second group when they're together like `[+](+)[+](+)[+](+)...` Any easy way to get rid of that last position zero-length? – cbdeveloper Jun 18 '19 at 09:29
  • I think I found what I need: `(?!$)([^[]*)(?:\[(.*?)\]\((.*?)\))?` – cbdeveloper Jun 18 '19 at 09:35
  • multiple [+](+)...N is a different ballgame though, right? because then it's an arbitrary length, which cannot be captured into match groups correctly. (you can capture the whole sequence then do another parse) – Scott Weaver Jun 18 '19 at 09:41
2

Solution 1: Splitting through matching

You may split by matching the pattern and getting substrings from the previous index up to the end of the match:

const text = 'textA [aaa](bbb) textB [ccc](ddd) textC'
const regexp = /\[[^\][]*\]\([^()]*\)/g;
let m, idx = 0, result=[];
while(m=regexp.exec(text)) {
  result.push(text.substring(idx, m.index + m[0].length).trim());
  idx = m.index + m[0].length;
}
if (idx < text.length) {
  result.push(text.substring(idx, text.length).trim())
}
console.log(result);

Note:

  • \[[^\][]*\]\([^()]*\) matches [, any 0+ chars other than [ and ] (with [^\][]*), then ](, then 0+ chars other than ( and ) (with [^()]*) and then a ) (see the regex demo)
  • The capturing groups are removed, but you may restore them and save in the resulting array separately (or in another array) if needed
  • .trim() is added to get rid of the leading/trailing whitespace (remove if not necessary).

Solution 2: Matching optional pattern

The idea is to match any chars before the pattern you have and then match either your pattern or end of string:

let result = text.match(/(?!$)(.*?)(?:\[(.*?)\]\((.*?)\)|$)/g);

If the string can have line breaks, replace . with [\s\S], or consider this pattern:

let result = text.match(/(?!$)([\s\S]*?)(?:\[([^\][]*)\]\(([^()]*)\)|$)/g);

See the regex demo.

JS demo:

const text = 'textA [aaa](bbb) textB [ccc](ddd) textC'
const regexp = /(?!$)(.*?)(?:\[(.*?)\]\((.*?)\)|$)/g;

const array = Array.from(text.matchAll(regexp));
console.log(JSON.stringify(array[0][0]));
console.log(JSON.stringify(array[1][0]));
console.log(JSON.stringify(array[2][0]));

Regex details

  • (?!$) - not at the end of string
  • (.*?) - Group 1: any 0+ chars other than line break chars as few as possible (change to [\s\S]*? if there can be line breaks or add s modifier since you target ECMAScript 2018)
  • (?:\[(.*?)\]\((.*?)\)|$) - either of the two alternatives:
    • \[(.*?)\]\((.*?)\) - [, Group 2: any 0+ chars other than line break chars as few as possible, ](, Group 3: any 0+ chars other than line break chars as few as possible, and a )
    • | - or
    • $ - end of string.
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Sorry I took too long to give a feedback. Your answer seems too work (it's missing some spaces from `textB` and `textC`), but my main problem with it is that it seemed not so readable to me. I would like it better to work with a `regex` and the `matchAll` method. Thank you. – cbdeveloper Jun 18 '19 at 09:07
  • I've got this `regex` which is basically working, but it's matching a zero-length match on the last position. `/([^\[]*)?(?:\[(.+?)\]\((.+?)\))?/gm` – cbdeveloper Jun 18 '19 at 09:08
  • @cbdev420 So, you want to use an unreadable regex solution? :) `/(?=[\s\S])([\s\S]*?)(?:\[([^\][]*)\]\(([^()]*)\)|$)/g` – Wiktor Stribiżew Jun 18 '19 at 09:09
  • I think that the regex I got now which is almost working is pretty readable. `([^\[]*)?(?:\[(.+?)\]\((.+?)\))?` Basically a group to match any character but the left bracket `[` if possible, then I try to match the pattern `[+](+)`. But I agree that readability is a point of view. It's really personal. – cbdeveloper Jun 18 '19 at 09:12
  • Can't I modify my current regex to ger rid of the last match? Or would I need to go to a completely different path? Thanks a lot for your help. – cbdeveloper Jun 18 '19 at 09:13
  • @cbdev420 I added the fixed regex solution to the answer. With two variations – Wiktor Stribiżew Jun 18 '19 at 09:14
  • Thanks! I'm still testing some options but yours is working for sure. – cbdeveloper Jun 18 '19 at 09:30
  • 1
    @cbdev420 `(?=.)` or `(?!$)` are synonymic. See [this answer of mine](https://stackoverflow.com/a/50188154/3832970). – Wiktor Stribiżew Jun 18 '19 at 09:44
0

That is what I've ended up using:

const text= 'textA [aaa](bbb) textB [ccc](ddd) textC'

const regexp = /(?!$)([^[]*)(?:\[(.*?)\]\((.*?)\))?/gm;

const array = Array.from(text.matchAll(regexp));

console.log(array);
cbdeveloper
  • 27,898
  • 37
  • 155
  • 336