2

Given the following simplified example text;

not me G(select me, and me)
G(select me) G(also me)

using regex expressions I would like to select everything between the G(...) as separate results even if there is, for example, a comma. Based on different answers here on SO this was my first attempt;

G\(([^)]+)\)

Works perfectly for second line but not so much for the first. My second attempt based on some other answers for selecting values from comma separated list;

G\(([^),]+)

Another attempt based on this SO, and another based on this SO.

Basically, I need help...

Expected output:

select me
and me
select me
also me
tstev
  • 607
  • 1
  • 10
  • 20
  • 2
    it doesn't - it matched the entire contents between `G(` and `)` - i want values seperated by the comma. Sorry if that wasn't clear from the question. – tstev Sep 13 '19 at 09:58
  • Please include the expected output in your question so we can help you get it from that input. – Ed Morton Sep 14 '19 at 00:57
  • I am pretty sure given the example text `select me` and the first sentence after that it was clear. But fair comment I can add it to make it even more obvious. – tstev Sep 15 '19 at 07:37
  • No, you could have wanted it all on one line, or the output segments on the same 3 lines as they appeared in the input, or everything on one line or you could have wanted unique outputs instead of all outputs or something else, and you could have wanted your output comma separated or something else. It's always best/required to show your expected output to remove all ambiguity. – Ed Morton Sep 15 '19 at 12:52
  • 1
    Ah yes fair enough indeed. I added expected format of result :) thanks – tstev Sep 16 '19 at 08:03

2 Answers2

5

Here is a way to do this in gnu awk. This appears more verbose but uses a fairly simple regex which doesn't depend on experimental PCRE regex option of gnu grep:

s="G(also me1) not me G(select me, and me) G(select me) G(also me)"
awk '{ 
   while ( match($0, /\<G\(([^)]*)\)(.*)/, a) ) {
      gsub(/ *, */, "\n", a[1])
      print a[1]
      $0 = a[2]
   }
}' <<< "$s"

also me1
select me
and me
select me
also me

Based on Ismail's comment below, if we want to make it POSIX compliant then use this awk command (because of non-availability of word boundary or \< in POSIX/BSD awk) :

awk '{
   while ( match($0, /(^|[[:blank:]])G\([^)]*\)/) ) {
      m=substr($0, RSTART+2, RLENGTH-3)
      sub(/^\(/, "", m)
      gsub(/ *, */, "\n", m)
      print m
      $0=substr($0, RSTART+RLENGTH)
   }
}' <<< "$s"
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    This is a GNU aided solution as well, POSIX EREs don't support `\<`. – oguz ismail Sep 13 '19 at 11:02
  • 3
    I am well aware of that and and answer doesn't claim anything about POSIX ERE since OP has used tag `linux`. – anubhava Sep 13 '19 at 11:04
  • My point is, this won't work with all awks. Not all awks are GNU awk – oguz ismail Sep 13 '19 at 11:16
  • 3
    To make it work on POSIX/BSD awk it would be: `awk '{ while ( match($0, /(^|[^_[:alnum:]])G\([^)]*\)/) ) { m=substr($0, RSTART+2, RLENGTH-3); sub(/^\(/, "", m); gsub(/ *, */, "\n", m); print m; $0=substr($0, RSTART+RLENGTH) } }' <<< "$s"` – anubhava Sep 13 '19 at 11:21
  • 1
    Thanks @EdMorton. That is very handy that surely shortened `gnu-awk` solution. I keep forgetting 3rd parameter of `match` in `gnu-awk`, thanks so much! – anubhava Sep 16 '19 at 08:45
3

With a GNU grep, you may use

(?:\G(?!^),\s*|\bG\()\K[^(),]+(?=[^()]*\))

See the regex demo.

Details

  • (?:\G(?!^),\s*|\bG\() - either the end of the previous match and a , followed with 0+ whitespace chars, or G( that has no letter, digit or _ right before
  • \K - omits the text matched so far
  • [^(),]+ - 1+ chars other than (, ) and ,
  • (?=[^()]*\)) - there must be 0+ chars other than ( and ) and then a ) immediately to the right of the current location.

See online demo:

rx='(?:\G(?!^),\s*|\bG\()\K[^(),]+(?=[^()]*\))'
example="not me G(select me, and me) G(select me) G(also me)"
grep -oP "$rx" <<< "$example"
# Also works with pcregrep: 
# pcregrep -o  "$rx" <<< "$example"

Output:

select me
and me
select me
also me
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • @tstev Yes, I also tested with `pcregrep -o '(?:\G(?!^),\s*|\bG\()\K[^(),]+(?=[^()]*\))' file`, it works well, too. – Wiktor Stribiżew Sep 13 '19 at 10:03
  • would really like to see if this can be achieved without `\K`. Mine is failing when there are more than one commas – CinCout Sep 13 '19 at 10:04
  • How do you build such a complicated regex so fast? – tstev Sep 13 '19 at 10:04
  • 2
    @tstev I am afraid I will have to say it is not complicated for me. – Wiktor Stribiżew Sep 13 '19 at 10:08
  • 1
    @CinCout It is not possible to write this without `\K` - unless you add more piped commands to post-process the results. – Wiktor Stribiżew Sep 13 '19 at 10:09
  • fair enough ;) i was just wondering if there are tools you use – tstev Sep 13 '19 at 10:09
  • 2
    @tstev Yes, https://regex101.com is the tool that simplifies writing PCRE, JS, Python and Go regexps. You just need to know how to use them later in the target environment, or port to the environment of your choice. – Wiktor Stribiżew Sep 13 '19 at 10:10
  • @WiktorStribiżew See this https://regex101.com/r/zRZtdO/5 The second entry with more than one `,` missed to capture the central string. – CinCout Sep 13 '19 at 10:10
  • 1
    @CinCout That is again a [repeated capturing group](https://www.regular-expressions.info/captureall.html). Also, see [How to capture multiple repeated groups?](https://stackoverflow.com/questions/37003623/how-to-capture-multiple-repeated-groups) – Wiktor Stribiżew Sep 13 '19 at 10:11
  • @WiktorStribiżew Okay I understood the repeated capturing group logic. But the linked answer isn't generalized for n capturing groups. Is that possible to achieve? – CinCout Sep 13 '19 at 10:21
  • 1
    @CinCout It is possible with some regex implementations, like .NET, PyPi regex module, or C++ Boost when compiled with a specific flag (but it is not a good idea to use it anyway). Even Onigmo had this implemented, but they decided to disable this functionality. – Wiktor Stribiżew Sep 13 '19 at 10:25
  • Yeah .Net allows that. So `\K` is the preferred way then? – CinCout Sep 13 '19 at 10:26
  • 1
    @CinCout When you want to get multiple matches between two non-identical multi-character delimiters only without the possibility to get captured substrings out of the match it is the only way. – Wiktor Stribiżew Sep 13 '19 at 10:29