3

Suppose I have the following markdown list items:

- [x] Example of a completed task.
- [x] ! Example of a completed task.
- [x] ? Example of a completed task.

I am interested to parse that item using regex and extract the following group captures:

  • $1: the left [ and the right ] brackets when the symbol x is in-between
  • $2: the symbol x in between the brackets [ and ]
  • $3: the modifier ! that follows after [x]
  • $4: the modifier ? that follows after [x]
  • $5: the text that follows [x] without a modifier, e.g., [x] This is targeted.
  • $6: the text that follows [x] !
  • $7: the text that follows [x] ?

After a lot of trial-and-error using online parsers, I came up with the following:

((?<=x)\]|\[(?=x]))|((?<=\[)x(?=\]))|((?<=\[x\]\s)!(?=\s))|((?<=\[x\]\s)\?(?=\s))|((?<=\[x\]\s)[^!?].*)|((?<=\[x\]\s!\s).*)|((?<=\[x\]\s\?\s).*)

To make the regex above more readable, these are the capture groups listed one by one:

  • $1: ((?<=x)\]|\[(?=x]))
  • $2: ((?<=\[)x(?=\]))
  • $3: ((?<=\[x\]\s)!(?=\s))
  • $4: ((?<=\[x\]\s)\?(?=\s))
  • $5: ((?<=\[x\]\s)[^!?].*)
  • $6: ((?<=\[x\]\s!\s).*)
  • $7: ((?<=\[x\]\s\?\s).*)

This is most likely not the best way to do it, but at least it seems to capture what I want:

Matches for the example list items

I would like to extend that regex to capture lines in a markdown table that looks like this:

|       | Task name                               |    Plan     |   Actual    |      File      |
| :---- | :-------------------------------------- | :---------: | :---------: | :------------: |
| [x]   | Task one with a reasonably long name.   | 08:00-08:45 | 08:00-09:00 |  [[task-one]]  |
| [x] ! | Task two with a reasonably long name.   | 09:00-09:30 |             |  [[task-two]]  |
| [x] ? | Task three with a reasonably long name. | 11:00-13:00 |             | [[task-three]] |

More specifically, I am interested in having the same group captures as above, but I would like to exclude the table grid (i.e., the |). So, groups $1 to $4 should stay the same, but groups $5 to $7 should capture the text, excluding the |, e.g., like in the selection below:

Matches for the example table

Do you have any ideas on how I can adjust, for example, the regex for group $5 to exclude the |. I have endlessly tried all sorts of negations (e.g., [^\|]). I am using Oniguruma regular expressions.

Mihai
  • 2,807
  • 4
  • 28
  • 53
  • 1
    For getting the column values you could do: https://regex101.com/r/kRIonQ/1 (regex: `((?<=\|)[^|]*)` ) – Luuk Dec 03 '21 at 18:36
  • 1
    "but groups `$5` to `$7` should capture the text, excluding the `|`" - I don't believe it is possible for a capture group to be made up from a sequence of non-contiguous characters. It would be best to capture these as additional separate groups or post-process after capture. – Dean Taylor Dec 03 '21 at 18:41
  • @DeanTaylor, I think you have very clearly formulated what I was trying to do, i.e., *to create a capture group from a sequence of non-contiguous characters*. Unfortunately, I cannot do any post-processing as the `regex` above is part of a grammar injection in VSCode. – Mihai Dec 03 '21 at 20:13
  • @Luuk, this seems interesting. If I understand correctly, you were able to create a single group while excluding the `|`. I will play around with it. – Mihai Dec 03 '21 at 20:15
  • @DeanTaylor, this combination `(?<=\|).*?(?=\|)` of positive lookbehind and positive lookahead seems to allow me to select everything in between adjacent pairs of `|` as a single group. Now I have to figure out how to exclude the pairs of `|` that contain `| [x] |`. – Mihai Dec 03 '21 at 21:18
  • 1
    Try `((?<=x)\]|\[(?=x]))|((?<=\[)x(?=\]))|((?<=\[x\]\s)!(?=\s))|(?<=\[x\]\s)(\?)(?=\s)|(?<=x].*?\|)(.*?)(?=\|)` if you are using the regex engine used in open document Search and Replace feature. See https://regex101.com/r/XBFkp2/2 – Wiktor Stribiżew Dec 03 '21 at 22:14
  • @WiktorStribiżew, I greatly appreciate that you took the time to look at this question! I focused on the `(?<=x].*?\|)(.*?)(?=\|)` part from the `regex` you wrote, and I see that the `*?` inside the positive lookbehind makes it non-fixed width. As per the documentation, the *pattern must have a fixed width*. About the engine, I am using [Oniguruma regular expressions](https://code.visualstudio.com/api/language-extensions/syntax-highlight-guide#textmate-grammars). – Mihai Dec 04 '21 at 09:12

2 Answers2

2

Inspired by the answer by Wiktor , check the following regex, which is quite short

(?:\G(?<!\A)\||(?:\[x]\s[?!]?\s*\|?))\K([^|\n]*)

The explanation to above

1.\G(?!\A)\|

\G asserts position at the end of the previous match or the start of the string for the first match. Negative Lookbehind (?!\A)

  1. \A asserts position at start of the string
  2. | matches the character |
  1. (?:\[x]\s[?!]?\s*\|?)

Non-capturing group. That matches [x], \s (space), [?|!] (zero or 1) followed by \s* (zero or more) and a | (zero or one)

  1. \K

\K resets the starting point of the reported match.

  1. ([^|\n]*)

All characters except | or \n (newline) matches previous token zero or unlimited times.

nps
  • 46
  • 4
  • Thanks for your suggestion! With the expression you indicated, the first `|` gets omitted, but subsequent ones are not. For example, in a construction of this form `| [x] | Task one with a reasonably long name. | 08:00-08:45 | 08:00-09:00 | [[task-one]] |`, I would like everything else but the `[x]` and `|` to be matched as a single group, while preserving the other groups I indicated in the question above. – Mihai Dec 03 '21 at 20:22
  • I am sorry, but this is not what I am looking for. I am interested to maintain all the seven group captures. With the `regex` you mentioned I get only one *incorrect* group capture, as shown here: https://regex101.com/r/NOuTOb/1. – Mihai Dec 04 '21 at 10:04
  • 1
    Maybe I'm getting your requirement wrong. One last clarification – nps Dec 04 '21 at 10:16
  • Maybe I am not explaining it clearly enough. I updated my question in the hope that the outcome will be more clear. Also, these are the groups and requirements: https://regex101.com/r/81Jc8a/1 (i.e., see *Description*). – Mihai Dec 04 '21 at 10:21
  • 1
    Updated answer, can be checked at https://regex101.com/r/5OaZc0/1 – nps Dec 05 '21 at 13:05
  • Yes, it gets the job done! I still find it hard to wrap my head around `\G(?!\A)\`. I accepted @Wiktor's answer as it was more timely, but I nevertheless appreciate your help! I noticed that when a capturing group is added around an expression that contains `\K`, the reset match is no longer respected. – Mihai Dec 05 '21 at 18:02
  • 1
    Thank you for the nice question and also @Wiktor. Got a good deal to learn – nps Dec 05 '21 at 19:20
  • I subscribe to that---I also have a lot to learn. I spent most of my weekend fiddling with your answer and @Wiktor's. I've learned about the `\K`, and now I am trying to see how to use it in conjunction with a capture group (e.g., https://stackoverflow.com/q/70237510/5252007). – Mihai Dec 05 '21 at 19:29
1

You can use

((?<=x)]|\[(?=x]))|((?<=\[)x(?=]))|((?<=\[x]\s)!(?=\s))|(?<=\[x]\s)(\?)(?=\s)|(?:\G(?!\A)\||(?<=\[x]\s[?!\s]\s\|))\K([^|\n]*)(?=\|)

See the regex101 PCRE and a Ruby (Onigmo/Oniguruma) demos.

What is added? The (?:\G(?!\A)\||(?<=\[x]\s[?!\s]\s\|))\K([^|\n]*)(?=\|) part:

  • (?: - start of a non-capturing group (a custom boundary here, we'll match...)
    • \G(?!\A)\| - either the end of the previous match and a | char (i.e. | must immediately follow the previous match),
    • |(?<=\[x]\s[?!\s]\s\|) - or a location that is immediately preceded with [x] + a whitespace + a ?, ! or whitespace + a whitespace and | char
  • ) - end of the group
  • \K - match reset operator that removes the text matched so far from the overall match memory buffer
  • ([^|\n]*) - zero or more chars other than | and a line feed char
  • (?=\|) - a | char must appear immediately to the right of the current location.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    I am thoroughly studying your solution and will report back! – Mihai Dec 04 '21 at 13:15
  • It took me a while to wrap my head around `\G`, `\A` and `\K`, but I can confirm that this works splendidly. The only thing is that when I try to create my capture groups, the `\K` seems to be ignored. For example, trying to match `b` and capture it as group `$1` (i.e., not `$0`) in string `ab` using `regex` `(a\Kb)` will match `ab`. – Mihai Dec 05 '21 at 17:55
  • @Mihai `\K` only affects the "**overall match** memory buffer", it does not affec captured texts. – Wiktor Stribiżew Dec 05 '21 at 21:56
  • I got really confused about it and I ended asking a follow-up question here (i.e., https://stackoverflow.com/q/70237510/5252007). The idea is that I want to use your solution in a capture group and things become a bit hard to understand… You helped a lot and I can stress enough how much I appreciate it. – Mihai Dec 05 '21 at 21:59