1

I have a series of strings I want to extract:

hello.this_is("bla bla bla")
some random text
hello.this_is('hello hello')
other stuff

What I need to get (from many files, but this is not important here) is the content between hello.this_is( and ), so my desired output is:

bla bla bla
hello hello

As you see, the text within parentheses can be enclosed with either double or single quotes.

If this was only single quotes I would use a look behind and look ahead just like this:

grep -Po "(?<=hello.this_is\(').*(?=')" file
#                            ^      ^
# returns ---> hello hello

Similarly, to get strings from double quotes I would say:

grep -Po '(?<=hello.this_is\(").*(?=")' file
#                            ^      ^
# returns ---> bla bla bla

However, I want to match both cases, so it gets both single and double quotes. I tried with using $'' to escape, but could not make it work:

grep -Po '(?<=hello.this_is\($'["\']').*(?=$'["\']')' file
#                            ^^^^^^^^      ^^^^^^^^

I can of course use the ASCII number and say:

grep -Po '(?<=hello.this_is\([\047\042]).*' file

but I would like to use the quotes and single quotes, since 047 and 042 are not that much representative to me as single and double quotes are.

fedorqui
  • 275,237
  • 103
  • 548
  • 598
  • If this is Python, you need to account for `r"..."`, `u'...'`, `r'''...'''`, `r"""...."""`, etc. – Wiktor Stribiżew Apr 03 '19 at 22:04
  • @WiktorStribiżew yes, it is. In this specific case all strings are only enclosed by a single - single or double quote (that is, no `'''`), without any formatter. In any case, good to take into account if the problem gets bigger. – fedorqui Apr 03 '19 at 22:06

3 Answers3

1

Use a capturing group and look for its content like the following:

grep -Po 'hello\.this_is\(([\047"])((?!\1).|\\.)*\1\)' file

This cares about escaped characters too e.g. hello.this_is("bla b\"la bla")

See live demo here

If the output should be what comes between parentheses then utilize both \K and a positive lookahead:

grep -Po 'hello\.this_is\(([\047"])\K((?!\1).|\\.)*(?=\1\))' file

Outputs:

bla bla bla
hello hello
revo
  • 47,783
  • 14
  • 74
  • 117
  • Uhms, this is clever! It would fail if we had something like `hello.this_is(XhelloX)`, but since this is Python code it won't happen :) – fedorqui Apr 03 '19 at 21:51
  • Please see the edit. I replaced `(.)` with `([\047"])`. – revo Apr 03 '19 at 21:53
  • Well the difficulty lies in using `'` and `"` instead of 047 and 042, since they collide with the two possible enclosing characters for a command. Still, I really +1 for the idea of the capturing group here. – fedorqui Apr 03 '19 at 21:56
  • I think you missed my last edit before posting your comment. Also I added a complementary approach. – revo Apr 03 '19 at 22:01
1

Note: The sed command at the bottom of this answer works only as long as your strings are nice behaving strings like

"foo"

or

'bar'

As soon as your strings start to misbehave :) like:

"hello \"world\""

it won't work any more.

Your input looks like source code. For a stable solution I recommend to use a parser for that language to extract the strings.


For trivial use cases:

You can use sed. The solution is supposed to work on any POSIX platform in contrast to grep -oP which only works with GNU grep:

sed -n 's/hello\.this_is(\(["'\'']\)\([^"]*\)\(["'\'']\).*/\2/gp' file
#                                    ^^^^^^^^              ^^
#                                          capture group 2 ^
hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • Nice one! So the key here is doing `["'\'']`. That is, closing the single quote and then adding `\'`, to finally open the single quote again. Cool to see your sed-fu again! – fedorqui Apr 03 '19 at 22:04
  • 1
    Nice to meat you again! :) Moved the `^^^^` s to the right position, I hope that's ok (and intended). Thanks for the edit! – hek2mgl Apr 03 '19 at 22:06
  • Of course! I added it to point where the capture group is, since it is difficult to see. My mistake makes it clear it is hard – fedorqui Apr 03 '19 at 22:08
  • To be fair, this would also match `"foo'`. Things like `"foo\"bar\""` will get extremely hard to match (if not impossible?).. – hek2mgl Apr 03 '19 at 22:09
  • I am facing a similar problem based on this original question! If you want to have a look it is in [How can you match everything up to the next captured group?](https://stackoverflow.com/q/57819083/1983854) – fedorqui Sep 06 '19 at 09:15
  • @fedorqui Hey, nice to meet you :) Will look later, having a training class today – hek2mgl Sep 06 '19 at 11:50
  • Oh, cool! It finally was a duplicate of a common problem (facepalm). – fedorqui Sep 06 '19 at 12:40
1

Based on revo and hek2mgl excellent answers, I ended up using grep like this:

grep -Po '(?<=hello\.this_is\((["'\''])).*(?=\1)' file

Which can be explained as:

  • grep
  • -Po use Perl regexp machine and just prints the matches
  • '(?<=hello\.this_is\((["'\''])).*(?=\1)' the expression
    • (?<=hello\.this_is\((["'\''])) look-behind: search strings preceeded by "hello.this_is(" followed by either ' or ". Also, capture this last character to be used later on.
    • .* match everything...
    • (?=\1) until the captured character (that is, either ' or ") appears again.

The key here was to use ["'\''] to indicate either ' or ". By doing '\'' we are closing the enclosing expression, populating with a literal ' (that we have to escape) and opening the enclosing expression again.

fedorqui
  • 275,237
  • 103
  • 548
  • 598