0

I have a source code which frequently includes a piece of code like

foo
(
    bar
    (
        foo0(<An arbitrary number of parenthesis may appear here>)
    ),
    foo1bar(<An arbitrary number of parenthesis may appear here>)
)

I want to capture this piece; the way that I am going for is

grep -A15 -E "foo[[:space:]]*$" <file_name>

to make sure that enough lines after foo are captured.

However, a more accurate way is looking for a pattern which counts opened/closed parenthesis after foo in order to stop searching right after the matching closed parenthesis of foo is found.

Is it possible to avoid scripting this algorithm by using grep options?

Example
My file is

...

foo
(
    bar
    (
        a(b)
    ),
    c(d)
)
...
dummy
(
    nextDummy()
)
...

where ... represents lines of code which does not contain any ( or ) character.The expected output of grep is

foo
(
    bar
    (
        a(b)
    ),
    c(d)
)
dummy
(
    nextDummy()
)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Naghi
  • 145
  • 5
  • Thanks for sharing your efforts, could you please post more meaningful samples of input and expected output in your question to make your question more clear(NOT my downvote btw), cheers. – RavinderSingh13 Sep 04 '22 at 13:45
  • @RavinderSingh13 I'm not sure if you receive notification when I edit the question; btw an example is added now. – Naghi Sep 04 '22 at 15:39
  • Thanks for adding samples. Could you please do let us know what those `.....` could be? Can it be a spaces(always)?? OR can it be digits? etc If you could let us know this then it will be more clear(I am already in process of writing code but in confusion because of this), cheers. – RavinderSingh13 Sep 04 '22 at 15:48
  • Only if the number of parenthesis is fixed (or at least bounded) it can be done with regular expressions. – LatinSuD Sep 04 '22 at 15:53
  • @RavinderSingh13 `...` may contain alphanumeric characters, spaces, or symbols; in other words, anything except `(` and `)`. – Naghi Sep 04 '22 at 17:21
  • @LatinSuD The number of parenthesis is bounded, but we do not know previously how many of them occurs where in the code. – Naghi Sep 04 '22 at 17:22
  • @Alish then they are not bounded – LatinSuD Sep 05 '22 at 06:55

2 Answers2

3

Using any awk in any shell on every Unix box to print all the functions to stdout:

$ awk '/^\(/{$0=prev ORS $0; f=1} f; /^)/{f=0} {prev=$0}' file
foo
(
    bar
    (
        a(b)
    ),
    c(d)
)
dummy
(
    nextDummy()
)

or to print every function to it's own file:

$ awk '/^\(/{close(out); out=prev; $0=prev ORS $0; f=1} f{print > out} /^)/{f=0} {prev=$0}' file

$ head -100 foo dummy
==> foo <==
foo
(
    bar
    (
        a(b)
    ),
    c(d)
)

==> dummy <==
dummy
(
    nextDummy()
)

or if you have a specific function you want to print:

$ awk -v tgt='foo' '/^\(/ && (prev==tgt){$0=prev ORS $0; f=1} f; /^)/{f=0} {prev=$0}' file
foo
(
    bar
    (
        a(b)
    ),
    c(d)
)

$ awk -v tgt='dummy' '/^\(/ && (prev==tgt){$0=prev ORS $0; f=1} f; /^)/{f=0} {prev=$0}' file
dummy
(
    nextDummy()
)

In the above we're assuming that a function body starts with ( on a line of it's own and ends with ) on a line of it's own and the function name is the line immediately preceding the start of the body.

Assuming whatever language your source code is written in supports strings and/or comments, it's impossible to do what you want just by counting parentheses as those could appear inside strings and comments.

You can't do this job 100% robustly without writing a parser for whatever language your source code is written, the best we can do with pattern matching against your source code is help you write a script that'll work with the subset of the language you provide as sample input/output.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    ++ve for nice answer. Just want to ask, is there any `awk` function(mostly not I believe, custom OR OOTB) OR any other specific tool which deals with code's data manipulation(where code is actually an input), thank you sir. – RavinderSingh13 Sep 05 '22 at 04:43
  • 2
    @RavinderSingh13 I'm not sure if I understand what you mean, but http://cscope.sourceforge.net exists for C if you're looking for a tool that can parse C code to find symbols, etc. and there are various C "beautifiers" around, e.g. https://linux.die.net/man/1/indent, that can take arbitrary C code and output it in a more regular format, and the C compiler, e.g. `gcc` has options to expand/remove symbols etc. to aid parsing, e.g. see how I use it at https://stackoverflow.com/a/35708616/1745001. Similar tools probably exist for other languages than C. – Ed Morton Sep 05 '22 at 12:02
  • Thank you sir for responding. Yes, I meant do we already have(may be not officially released but in form of beta OR in git gawk repo?) OR we can request `gawk` developers to build some function to check input(where it deals with coding), I mean in this case we need to make sure all `(` and `)` are equal and opened/closed properly. Because if we can get this kind of function in default then it will be great addition to gawk. That's why I thought to check with you on same. – RavinderSingh13 Sep 05 '22 at 12:53
  • 1
    @RavinderSingh13 no, that doesn't exist and IMHO there's no chance of the gawk folks being willing to take it on. It's more than just `(` and `)` counts being equal, e.g. consider code like `foo ( print "1) this is an issue, 2) as is this" )` which has 2 `(`s inside a string that you would not want counted and same with comments like `// oops ) is here` and if this was `C` then you'd also have to handle [trigraphs](https://en.wikipedia.org/wiki/Digraphs_and_trigraphs) like `??(`, etc. and standards change for languages (e.g. K&R C, ANSI C, C99 are 3 C specs with different comment syntax). – Ed Morton Sep 05 '22 at 12:59
  • Yeah, if that is the case then you are right, may be loving `awk` means I need everything in it. sometimes I think if I would have know C then may be I would have tried to contribute but that's very far thing :) – RavinderSingh13 Sep 05 '22 at 13:01
  • 1
    I wrote primarily C for 30+ years and wouldn't consider myself as really **knowing** C - it has a lot of dark corners and multiple quite different standards. I stuck to the subset I knew (so e.g. I never uses a trigraph!). Awk is a tiny subset of C and I wouldn't try writing a tool that could parse awk robustly. Having said that - gawk `-o-` can produce pretty-printed output so, depending on what you'd want the output to be, you might be able to parse that and/or ask the gawk folks for some different form of output. – Ed Morton Sep 05 '22 at 13:11
0

If your grep supports -P (PCRE) option, would you please try:

grep -zoP "[A-Za-z_]\w*\s*(\((?:[^()]+|(?1))*\))" file

Output with the provided file:

foo
(
    bar
    (
        a(b)
    ),
    c(d)
)
dummy
(
    nextDummy()
)
  • [A-Za-z_]\w*\s* matches the names such as foo or dummy followed by posible space characters.
  • (\((?:[^()]+|(?1))*\)) matches a substring enclosed by parantheses including the sequence of either of:
    • [^()]+: any characters other than parentheses
    • (?1): recursion of the pattern enclosed by the outermost parentheses
tshiono
  • 21,248
  • 2
  • 14
  • 22