1

I found this question about using capture groups with the \K reset match (i.e., not sure if that's the correct name), but it does not answer my query.

Suppose I have the following string:

ab

With the following regex a\Kb the output is, as expected, b:

enter image description here

However, when adding a capture group (i.e., $1) using the regex (a\Kb), group $1 returns ab and not a:

enter image description here

Given the following string:

ab
cd

Using the regex (a\Kb)|(c\Kd) I would hope group $1 to contain b and group $2 to contain d, but that is not the case as it can be seen below:

enter image description here

I tried Wiktor Stribiżew's answer that points to using a branch reset group:

(?|a\Kb)|(?|c\Kd)

Which produces:

enter image description here

However, now the matches are both part of group $0, whereas I require them to be part of group $1 and $2, respectively. Do you have any ideas on how this can be achieved? I am using Oniguruma regular expressions and the PCRE flavor.


Update based on the comments below.

The example above was meant to be easy to understand and reproduce. @Booboo pointed out that a non-capturing group does the trick, i.e.,:

(?:a\K(b))|(?:c\K(d))

Produce the output:

enter image description here

However, when applied to another example it fails. Therefore, for clarity, I am extending this question to cover the more complicated scenario discussed in the comments.

Suppose I have the following text in a markdown file:

- [x] Example task. | Task ends. [x] Another task.
- [x] ! Example task. | This ends. [x] ! Another task.

This is a sentence. [x] Task is here.
Other text. Another [x] ! Task is here.

|       | Task name     |    Plan     |   Actual    |      File      |
| :---- | :-------------| :---------: | :---------: | :------------: |
| [x]   | Task example. | 08:00-08:45 | 08:00-09:00 |  [[task-one]]  |
| [x] ! | Task example. | 08:00-08:45 | 08:00-09:00 |  [[task-one]]  |

I am interested in a single regex expression with two capture groups as follows:

  • group $1 (i.e., see selection below):

    • outside the table: capture everything after [x] (i.e., not followed by !) until a |

    • inside the table: capture everything after [x] (i.e., not followed by !) excluding the | symbols

      Matches for first capture group

  • group $2 (i.e., see selection below):

    • outside the table: capture everything after [x] ! until a |

    • inside the table: capture everything after [x] ! excluding the | symbols

      Mataches for the second capture group

I have the following regex (i.e., see demo here) that works when evaluated individually, but not when used inside a capture group:

  • group $1:
    • outside the table: [^\|\s]\s*\[x\]\s*\K[^!|\n]*
    • inside the table: (?:\G(?!\A)\||(?<=\[x]\s)\s*\|)\K[^|\n]*(?=\|)
  • group $2:
    • outside the table: [^\|\s]\s*\[x\]\s*\!\s*\K[^|\n]*
    • inside the table: (?:\G(?!\A)\||(?<=\[x]\s)\s*\!\s*\|)\K[^|\n]*(?=\|)

The problem I am experiencing is when combining the expressions above.

Pseudo regex:

([x] outside|[x] inside)|([x] ! outside|[x] ! inside)

Actual regex:

([^\|\s]\s*\[x\]\s*\K[^!|\n]*|(?:\G(?!\A)\||(?<=\[x]\s)\s*\|)\K[^|\n]*(?=\|))|([^\|\s]\s*\[x\]\s*\!\s*\K[^|\n]*|(?:\G(?!\A)\||(?<=\[x]\s)\s*\!\s*\|)\K[^|\n]*(?=\|))

Which produces (i.e., as in the demo linked above):

enter image description here

The regex for the matches inside the table is based on Wiktor Stribiżew's answer and explained here.

Mihai
  • 2,807
  • 4
  • 28
  • 53
  • 1
    Why not just use a *lookbehind assertion* instead: `((?<=a)b)`. See [regex demo](https://regex101.com/r/Eq64HM/1/) – Booboo Dec 05 '21 at 19:06
  • @Booboo, my actual pattern is more complicated than the MRE in the question (e.g., https://regex101.com/r/mQl58d/1) and I am not sure I can do it with a *lookbehind*... – Mihai Dec 05 '21 at 19:23
  • @Booboo, I think the reason why I cannot use a *positive lookbehind* assertion is because of the fixed-width restriction. – Mihai Dec 05 '21 at 19:32
  • Anyway, perhaps Mr. Stribiżew will see this and chime in. You should also tag your question with the language you are using. Interesting that it supports \K but not varying length lookbehinds. – Booboo Dec 05 '21 at 19:32
  • 1
    And you would not be happy with `(?:a\K(b))` ? See [regex demo](https://regex101.com/r/orLW3F/1/) – Booboo Dec 05 '21 at 19:37
  • @Booboo, thanks, I'll go ahead and tag it. I am using the `PCRE2` flavour and the `regex` in question is for a [grammar language injection in VSCode](https://code.visualstudio.com/api/language-extensions/syntax-highlight-guide#textmate-grammars). – Mihai Dec 05 '21 at 19:37
  • @Booboo, ha! The `(?:a\K(b))|(?:c\K(d))` seems to do the trick (e.g., [`regex` demo](https://regex101.com/r/TnZkru/1). But why? – Mihai Dec 05 '21 at 19:44
  • 1
    (?: blah-blah) is a *non-capturing* group. So there is only a single *capturing* group, namely capture group 1, in my regex. – Booboo Dec 05 '21 at 19:46
  • @Booboo, that makes sense. Yet, I just tried it for the more complicated example and it fails (e.g., [`regex` demo](https://regex101.com/r/0OcU2q/1)). Sorry for the back and forth, but appreciate your input. – Mihai Dec 05 '21 at 19:52
  • 1
    Not sure what you are looking for, maybe something like that: https://regex101.com/r/L6C6UL/1 – Casimir et Hippolyte Dec 05 '21 at 20:28
  • @CasimiretHippolyte, I apologize for the confusion. I wanted to only provide an MRE but that backfired... I updated the question to reflect what I really am after. – Mihai Dec 05 '21 at 21:53
  • 1
    If you are sure there's always a horizontal space at the end of each cells, you can eventually do that: https://regex101.com/r/1o4OyE/2 – Casimir et Hippolyte Dec 05 '21 at 22:51
  • @CasimiretHippolyte, I carefully studied your solution and I think it fits the bill. I want to test a few more cases and I will report back! – Mihai Dec 06 '21 at 18:37

4 Answers4

1

Instead of \K, try to use control verbs (*SKIP)(*F):

(a(*SKIP)(*F)|b)|(c(*SKIP)(*F)|d)

Check the test case.

Hao Wu
  • 17,573
  • 6
  • 28
  • 60
  • This is very interesting! I will experiment around with these verbs in the more complicated example and report back! – Mihai Dec 06 '21 at 18:43
1

You can use

(?|(?:\G(?!\A)(?<=\|)|^\|\h*\[x\]\h*\|)\h*\K([^|\n]+)(?<=\S)\h*\||\[x]\h*\K([^|\s!]+(?:\h*[^|\s]+)*))|(?|(?:\G(?!\A)\||^\|\h*\[x]\h*!\h*\|)\h*\K([^|\n]+)(?<=\S)\h*|\[x]\h*!\h*\K([^|\s]+(?:\h*[^|\s]+)*))

See the regex demo. Details:

  • (?|(?:\G(?!\A)(?<=\|)|^\|\h*\[x\]\h*\|)\h*\K([^|\n]+)(?<=\S)\h*\||\[x]\h*\K([^|\s!]+(?:\h*[^|\s]+)*)) - a branch reset group matching:

    • (?:\G(?!\A)(?<=\|)|^\|\h*\[x\]\h*\|) - a non-capturing group matching either
      • \G(?!\A)(?<=\|) - the end of the previous successful match that is immediately preceded with a | char
    • | - or
      • ^\|\h*\[x\]\h*\| - start of a line/string, |, zero or more horizontal whitespaces, [x], zero or more horizontal whitespaces, |
    • \h*\K - zero or more horizontal whitespaces that are immediately discarded from the match value after matching
    • ([^|\n]+)(?<=\S) - Group 1: one or more chars other than a LF and |, as many as possible, but the chunk should match with a non-whitespace char
    • \h*\| - zero or more horizontal whitespaces and a | char
  • | - or

    • \[x]\h*\K - [x], zero or more horizontal whitespaces, and this text is discarded from the match value
    • ([^|\s!]+(?:\h*[^|\s]+)*) - Group 1 (mind it is a branch reset group): one or more chars other than !, | and whitespace, and then zero or more occurrences of zero or more horizontal whitespaces and then one or more chars other than | and whitespace
  • | - or

  • (?|(?:\G(?!\A)\||^\|\h*\[x]\h*!\h*\|)\h*\K([^|\n]+)(?<=\S)\h*|\[x]\h*!\h*\K([^|\s]+(?:\h*[^|\s]+)*)) - a branch reset group:

    • (?:\G(?!\A)\||^\|\h*\[x]\h*!\h*\|) - end of the previous successful match and a | char after, or start of string, |, zero or more horizontal whitespaces, [x], ! enclosed with zero or more horizontal whitespaces, a | char
    • \h*\K - zero or more horizontal whitespaces and the whole text matched so far is discarded from the match value
    • ([^|\n]+)(?<=\S) - Group 2: any one or more chars other than LF and | chars that end with a non-whitespace char
    • \h* - zero or more horizontal whitespaces
  • | - or

    • \[x] - a [x] string
    • \h*!\h*\K - ! enclosed with zero or more horizontal whitespaces and the whole text matched so far is discarded from the match value
    • ([^|\s]+(?:\h*[^|\s]+)*) - Group 2 (mind it is a branch reset group): one or more chars other than | and whitespace, and then zero or more occurrences of zero or more horizontal whitespaces and then one or more chars other than | and whitespace.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • That's exactly what I was trying to accomplish! Thanks a lot for the detailed explanation. I've learned a lot from you these past few days. – Mihai Dec 18 '21 at 13:46
0

If I understand what you are trying to match, use as a regex:

(?:[^|\s]\s*\[x\](?!\s*!)\s*\K([^!|\n]*))|(?:[^|\s]\s*\[x\]\s*!\s*\K([^|\n]*))

See Regex Demo

I removed some unnecessary escaping. But moreover:

For Group 1 matches (first alternative) before the |, note that I have after we have matched '[x]` the following negative lookahead assertion:

(?!\s*!)

This ensures that the [x] is not followed by 0 or more spaces followed by an exclamation mark. Only then do you want to match everything up to the next exclamation mark or newline as Group 1.

Booboo
  • 38,656
  • 3
  • 37
  • 60
  • As I also said above, I apologize for the confusion. I wanted to only provide an MRE but that may have not been a wise choice. I updated my question to be as clear as possible about the intended outcome. – Mihai Dec 05 '21 at 21:54
0

Taking the example you've provided on regex101 the following expression can be tried, with one caveat, the text should not not contain any third bracket other than "[x]"

(?<!\|\s)(((?:\[x]\s[!]?))\K[^[\n]+)

Explaining the above

  1. (?<!\|\s)
  • This negative lookahead will discard the table, as you mentioned
  1. (?:[x]\s[!]?)
  • This is a Non-Capturing group that will match "[x] " or "[x] !"
  1. \K (Optional)
  • \K resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match
  1. [^[\n]+
  • negative match the previous token between one and unlimited times.

Regex101 sample

nps
  • 46
  • 4
  • Please try out and tell. Do check on the mentioned caveat – nps Dec 06 '21 at 15:49
  • Thanks again for your help! I tried it (e.g., [demo here](https://regex101.com/r/UvZudU/1)), but it doesn't match the requirements. I added a more detailed description in the demo linked above about what each group should capture. The closest I got is in [this demo here](https://regex101.com/r/4dJVlz/6). – Mihai Dec 06 '21 at 18:51
  • Added [ to the list of tokens at which to stop matching. Side effect is the text in the table with [[task name]] https://regex101.com/r/QUWUEK/1. – nps Dec 06 '21 at 20:23