2

I'm looking to have a regex to match all potentially multiline calls to a variadic c function. The end goal is to print the file, line number, and the fourth parameter of each call, but unfortunately I'm not there yet. So far, I have this:

 perl -ne 'print if s/^.*?(func1\s*\(([^\)\(,]+||,|\((?2)\))*\)).*?$/$1/s' test.c

with test.c:

int main() {
        func1( a, b, c, d);
        func1( a, b,
               c, d);
        func1( func2(), b, c, d, e );
        func1( func2(a), b, c, d, e );
        return 1;
}

-- which does not match the second call. The reason it doesn't match is that the s at the end of the expression allows . to match newlines, but doesn't seem to allow [..] constructs to match newlines. I'm not sure how to get past this.

I'm also not sure how to reference the fourth parameter in this... the $2, $3 do not get populated in this (and even if they did I imagine I would get some issues due to the recursive nature of the regex).

HardcoreHenry
  • 5,909
  • 2
  • 19
  • 44
  • I guess you just need `perl -0777 -ne...`, see [Perl command line multi-line replace](https://stackoverflow.com/q/9670426/3832970). And as for accessing all captures, you can check [Difference in Perl regex variable $+{name} and $-{name}](https://stackoverflow.com/q/70750715/3832970). – Wiktor Stribiżew Aug 16 '22 at 15:36
  • If I do this, it only matches the first call (the `.*?$` at the end slurps up the rest of the file). – HardcoreHenry Aug 16 '22 at 15:38
  • If you need multiline mode, add `m` flag then. – Wiktor Stribiżew Aug 16 '22 at 15:44
  • Ok, the following does it: `perl -0777 -ne 'my @matches = $_ =~ /(func1\s*\(([^\)\(,]+|,||\((?2)\))*\))/g; print "$_\n" for @matches' test.c` – HardcoreHenry Aug 16 '22 at 15:51
  • Yes, here, you got rid of the anchors, so no need of `m` flag any more. – Wiktor Stribiżew Aug 16 '22 at 15:55
  • Note C in general is quite difficult to parse. You could have `/*comments),(*/` and `"quotes\"),("` in the middle of the function, commas nested within arguments like `f(a, g(b,c), h(d,(x+y)*z), fourth)`, ugly macros which expand to a `,` or unbalanced `(` or `)`, etc. – aschepler Aug 16 '22 at 16:46

2 Answers2

3

This should catch your functions, with caveats

perl -0777 -wnE'@f = /(func1\s*\( [^;]* \))\s*;/xg; s/\s+/ /g, say for @f' tt.c

I use the fact that a statement must be terminated by ;. Then this excludes an accidental ; in a comment and it excludes calls to this being nested inside another call. If that is possible then quite a bit more need be done to parse it.

However, further parsing the captured calls, presumably by commas, is complicated by the fact that a nested call may well, and realistically, contain commas. How about

func1( a, b, f2(a2, b2), c, f3(a3, b3), d );

This becomes a far more interesting little parsing problem. Or, how about macros?

Can you clarify what kinds of things one doesn't have to account for?


As the mentioned caveats may be possible to ignore here is a way to parse the argument list, using Text::Balanced.

Since we need to extract whole function calls if they appear as an argument, like f(a, b), the most suitable function from the library is extract_tagged. With it we can make the opening tag be a word-left-parenthesis (\w+\() and the closing one a right-parenthesis \).

This function extracts only the first occurrence so it is wrapped in extract_multiple

use warnings;
use strict;
use feature 'say';

use Text::Balanced qw(extract_multiple extract_tagged);
use Path::Tiny;  # path(). for slurp

my $file = shift // die "Usage: $0 file-to-parse\n";

my @functions = path($file)->slurp =~ /( func1\( [^;]* \) );/xg; 
s/\s+/ /g for @functions; 

for my $func (@functions) { 
    my ($args) = $func =~ /func1\s*\(\s* (.*) \s*\)/x;
    say $args;

    my @parts = extract_multiple( $args, [ sub { 
        extract_tagged($args, '\\w+\\(', '\\\)', '.*?(?=\w+\()')
    } ] );

    my @arguments = grep { /\S/ } map { /\(/ ? $_ : split /\s*,\s*/ } @parts;
    s/^\s*|\s*\z//g for @arguments;
    say "\t$_" for @arguments;
}

The extract_multiple returns parts with the (nested) function calls alone (identifiable by having parens), which are arguments as they stand and what we sought with all this, and parts which are strings with comma-separated groups of other arguments, that are split into individual arguments.

Note the amount of escaping in extract_tagged (found by trial and error)! This is needed because those strings are twice double-quoted in a string-eval. That isn't documented at all, so see the source (eg here).

Or directly produce escape-hungry characters (\x5C for \), which then need no escaping

extract_tagged($_[0], "\x5C".'w+'."\x5C(", '\x5C)', '.*?(?=\w+\()')

I don't know which I'd call "clearer"

I tested on the file provided in the question, to which I added a function

func1( a, b, f2(a2, f3(a3, b3), b2), c, f4(a4, b4), d, e );

For each function the program prints the string with the argument list to parse and the parsed arguments, and the most interesting part of the output is for the above (added) function

[ ... ]
a, b, f2(a2, f3(a3, b3), b2), c, f4(a4, b4), d, e 
        a
        b
        f2(a2, f3(a3, b3), b2)
        c
        f4(a4, b4)
        d
        e
zdim
  • 64,580
  • 5
  • 52
  • 81
  • For my immediate needs, I know that none of the parameters are strings, but they may contain formulas (with braces). So I can assume the `;` trick will work (I think I can reasonably assume there's no `;`'s in comments...) As far as `,`,s go, I can use the recursive formula and copy the regex for each parameter, and then only capture the fourth one... – HardcoreHenry Aug 16 '22 at 18:07
  • @HardcoreHenry OK, those restrictions are good news. Commas can be done reasonably easily, but it's more work than what I'd consider a one-liner-"grade". And I'd rather use something like [Text::Balanced](https://perldoc.perl.org/Text::Balanced), or the similar regex from a package. (This being a nested call, so that `;` can't be used ... that would mean yet more work.) Would you like me to add that? – zdim Aug 16 '22 at 18:11
  • For matching nested items this nice [post](https://stackoverflow.com/a/15302308/4653379) came up quick, and I can readily find my snippets [here](https://stackoverflow.com/a/57323443/4653379) and [here](https://stackoverflow.com/a/64256610/4653379) for example. There's a lot more out there – zdim Aug 16 '22 at 18:23
  • @HardcoreHenry By "regex from a package" I meant `Regexp::Common`, see in links in my previous comment. – zdim Aug 16 '22 at 18:51
  • @HardcoreHenry Added a way to parse the string with the list of arguments. Will probably edit when I get to read it over (it's tested and it works), let me know if there are questions/comments/etc – zdim Aug 17 '22 at 08:14
2

Not Perl but perhaps simpler:

$ cat >test2.c <<'EOD'
int main() {
    func1( a, b, c, d1);
    func1( a, b,
           c, d2);
    func1( func2(), "quotes\"),(", /*comments),(*/ g(b,
c), "d3", e );
    func1( func2(a), b, c, d4(p,q,r), e );
    func1( a, b, c, func2( func1(a,b,c,d5,e,f) ), g, h);
    return 1;
}
EOD

$ cpp -D'func1(a,b,c,d,...)=SHOW(__FILE__,__LINE__,d,)' test2.c |
  grep SHOW
    SHOW("test2.c",2,d1);
    SHOW("test2.c",3,d2)
    SHOW("test2.c",5,"d3")
    SHOW("test2.c",7,d4(p,q,r));
    SHOW("test2.c",8,func2( SHOW("test2.c",8,d5) ));
$

As the final line shows, a bit more work is needed if the function can take itself as an argument.

jhnc
  • 11,310
  • 1
  • 9
  • 26