How to exclude occurrences after a positive lookbehind?

Question

Suppose I have the following markdown list items:

- [x] Example of a completed task.
- [x] ! Example of a completed task.
- [x] ? Example of a completed task.

I am interested to parse that item using regex and extract the following group captures:

$1: the left [ and the right ] brackets when the symbol x is in-between
$2: the symbol x in between the brackets [ and ]
$3: the modifier ! that follows after [x]
$4: the modifier ? that follows after [x]
$5: the text that follows [x] without a modifier, e.g., [x] This is targeted.
$6: the text that follows [x] !
$7: the text that follows [x] ?

After a lot of trial-and-error using online parsers, I came up with the following:

((?<=x)\]|\[(?=x]))|((?<=\[)x(?=\]))|((?<=\[x\]\s)!(?=\s))|((?<=\[x\]\s)\?(?=\s))|((?<=\[x\]\s)[^!?].*)|((?<=\[x\]\s!\s).*)|((?<=\[x\]\s\?\s).*)

To make the regex above more readable, these are the capture groups listed one by one:

$1: ((?<=x)\]|\[(?=x]))
$2: ((?<=\[)x(?=\]))
$3: ((?<=\[x\]\s)!(?=\s))
$4: ((?<=\[x\]\s)\?(?=\s))
$5: ((?<=\[x\]\s)[^!?].*)
$6: ((?<=\[x\]\s!\s).*)
$7: ((?<=\[x\]\s\?\s).*)

This is most likely not the best way to do it, but at least it seems to capture what I want:

I would like to extend that regex to capture lines in a markdown table that looks like this:

|       | Task name                               |    Plan     |   Actual    |      File      |
| :---- | :-------------------------------------- | :---------: | :---------: | :------------: |
| [x]   | Task one with a reasonably long name.   | 08:00-08:45 | 08:00-09:00 |  [[task-one]]  |
| [x] ! | Task two with a reasonably long name.   | 09:00-09:30 |             |  [[task-two]]  |
| [x] ? | Task three with a reasonably long name. | 11:00-13:00 |             | [[task-three]] |

More specifically, I am interested in having the same group captures as above, but I would like to exclude the table grid (i.e., the |). So, groups $1 to $4 should stay the same, but groups $5 to $7 should capture the text, excluding the |, e.g., like in the selection below:

Do you have any ideas on how I can adjust, for example, the regex for group $5 to exclude the |. I have endlessly tried all sorts of negations (e.g., [^\|]). I am using Oniguruma regular expressions.

For getting the column values you could do: https://regex101.com/r/kRIonQ/1 (regex: `((?<=\|)[^|]*)` ) — Luuk, Dec 03 '21 at 18:36
"but groups `$5` to `$7` should capture the text, excluding the `|`" - I don't believe it is possible for a capture group to be made up from a sequence of non-contiguous characters. It would be best to capture these as additional separate groups or post-process after capture. — Dean Taylor, Dec 03 '21 at 18:41
@DeanTaylor, I think you have very clearly formulated what I was trying to do, i.e., *to create a capture group from a sequence of non-contiguous characters*. Unfortunately, I cannot do any post-processing as the `regex` above is part of a grammar injection in VSCode. — Mihai, Dec 03 '21 at 20:13
@Luuk, this seems interesting. If I understand correctly, you were able to create a single group while excluding the `|`. I will play around with it. — Mihai, Dec 03 '21 at 20:15
@DeanTaylor, this combination `(?<=\|).*?(?=\|)` of positive lookbehind and positive lookahead seems to allow me to select everything in between adjacent pairs of `|` as a single group. Now I have to figure out how to exclude the pairs of `|` that contain `| [x] |`. — Mihai, Dec 03 '21 at 21:18
Try `((?<=x)\]|\[(?=x]))|((?<=\[)x(?=\]))|((?<=\[x\]\s)!(?=\s))|(?<=\[x\]\s)(\?)(?=\s)|(?<=x].*?\|)(.*?)(?=\|)` if you are using the regex engine used in open document Search and Replace feature. See https://regex101.com/r/XBFkp2/2 — Wiktor Stribiżew, Dec 03 '21 at 22:14
@WiktorStribiżew, I greatly appreciate that you took the time to look at this question! I focused on the `(?<=x].*?\|)(.*?)(?=\|)` part from the `regex` you wrote, and I see that the `*?` inside the positive lookbehind makes it non-fixed width. As per the documentation, the *pattern must have a fixed width*. About the engine, I am using [Oniguruma regular expressions](https://code.visualstudio.com/api/language-extensions/syntax-highlight-guide#textmate-grammars). — Mihai, Dec 04 '21 at 09:12

nps · Answer 1 · 2021-12-05T13:04:28.317

2

Inspired by the answer by Wiktor , check the following regex, which is quite short

(?:\G(?<!\A)\||(?:\[x]\s[?!]?\s*\|?))\K([^|\n]*)

The explanation to above

1.\G(?!\A)\|

\G asserts position at the end of the previous match or the start of the string for the first match. Negative Lookbehind (?!\A)

\A asserts position at start of the string

| matches the character |

(?:\[x]\s[?!]?\s*\|?)

Non-capturing group. That matches [x], \s (space), [?|!] (zero or 1) followed by \s* (zero or more) and a | (zero or one)

\K

\K resets the starting point of the reported match.

([^|\n]*)

All characters except | or \n (newline) matches previous token zero or unlimited times.

edited Dec 05 '21 at 13:04

answered Dec 03 '21 at 18:59

nps

46
4

Thanks for your suggestion! With the expression you indicated, the first `|` gets omitted, but subsequent ones are not. For example, in a construction of this form `| [x] | Task one with a reasonably long name. | 08:00-08:45 | 08:00-09:00 | [[task-one]] |`, I would like everything else but the `[x]` and `|` to be matched as a single group, while preserving the other groups I indicated in the question above. – Mihai Dec 03 '21 at 20:22
I am sorry, but this is not what I am looking for. I am interested to maintain all the seven group captures. With the `regex` you mentioned I get only one *incorrect* group capture, as shown here: https://regex101.com/r/NOuTOb/1. – Mihai Dec 04 '21 at 10:04
1

Maybe I'm getting your requirement wrong. One last clarification – nps Dec 04 '21 at 10:16
Maybe I am not explaining it clearly enough. I updated my question in the hope that the outcome will be more clear. Also, these are the groups and requirements: https://regex101.com/r/81Jc8a/1 (i.e., see *Description*). – Mihai Dec 04 '21 at 10:21
1

Updated answer, can be checked at https://regex101.com/r/5OaZc0/1 – nps Dec 05 '21 at 13:05
Yes, it gets the job done! I still find it hard to wrap my head around `\G(?!\A)\`. I accepted @Wiktor's answer as it was more timely, but I nevertheless appreciate your help! I noticed that when a capturing group is added around an expression that contains `\K`, the reset match is no longer respected. – Mihai Dec 05 '21 at 18:02
1

Thank you for the nice question and also @Wiktor. Got a good deal to learn – nps Dec 05 '21 at 19:20
I subscribe to that---I also have a lot to learn. I spent most of my weekend fiddling with your answer and @Wiktor's. I've learned about the `\K`, and now I am trying to see how to use it in conjunction with a capture group (e.g., https://stackoverflow.com/q/70237510/5252007). – Mihai Dec 05 '21 at 19:29

score 1 · Accepted Answer · answered Dec 04 '21 at 11:38

1

You can use

((?<=x)]|\[(?=x]))|((?<=\[)x(?=]))|((?<=\[x]\s)!(?=\s))|(?<=\[x]\s)(\?)(?=\s)|(?:\G(?!\A)\||(?<=\[x]\s[?!\s]\s\|))\K([^|\n]*)(?=\|)

See the regex101 PCRE and a Ruby (Onigmo/Oniguruma) demos.

What is added? The (?:\G(?!\A)\||(?<=\[x]\s[?!\s]\s\|))\K([^|\n]*)(?=\|) part:

(?: - start of a non-capturing group (a custom boundary here, we'll match...)
- \G(?!\A)\| - either the end of the previous match and a | char (i.e. | must immediately follow the previous match),
- |(?<=\[x]\s[?!\s]\s\|) - or a location that is immediately preceded with [x] + a whitespace + a ?, ! or whitespace + a whitespace and | char
) - end of the group
\K - match reset operator that removes the text matched so far from the overall match memory buffer
([^|\n]*) - zero or more chars other than | and a line feed char
(?=\|) - a | char must appear immediately to the right of the current location.

answered Dec 04 '21 at 11:38

Wiktor Stribiżew

607,720
39
448
563

1

I am thoroughly studying your solution and will report back! – Mihai Dec 04 '21 at 13:15
It took me a while to wrap my head around `\G`, `\A` and `\K`, but I can confirm that this works splendidly. The only thing is that when I try to create my capture groups, the `\K` seems to be ignored. For example, trying to match `b` and capture it as group `$1` (i.e., not `$0`) in string `ab` using `regex` `(a\Kb)` will match `ab`. – Mihai Dec 05 '21 at 17:55
@Mihai `\K` only affects the "**overall match** memory buffer", it does not affec captured texts. – Wiktor Stribiżew Dec 05 '21 at 21:56
I got really confused about it and I ended asking a follow-up question here (i.e., https://stackoverflow.com/q/70237510/5252007). The idea is that I want to use your solution in a capture group and things become a bit hard to understand… You helped a lot and I can stress enough how much I appreciate it. – Mihai Dec 05 '21 at 21:59

How to exclude occurrences after a positive lookbehind?

2 Answers2

Linked