-1

I have a C file. With a C style set of comments /* */ followed by a variable defined for each comment. The variable name is also in the comment. Some comments contain variable names they are not for (see the 3rd comment in the below example)

Here's an example of the format:

/* Object: function1: Does some really cool things and then it ends */
const function1 = someValue;

/* Object: function2: Does more really cool things and then it ends */
const function2 = someValue2;

/* Object: function3: Does even more really cool things
just like function2, does but continues over to the next line for a multiline comment */
const function3 = someValue3;

/* Object: function4: Does all kinds of cool things
and needs function1 in order to set a value correctly */
const function4 = someValue4;

/* Object: function5: Does some other cool things
and needs function2[with another variable] to do some things */
const function5 = someBValue5;

I only want to match the variable names with a result like this: function1 function2 function3 function4 function5

I've been playing around with this on https://regexr.com/ for hours and I cannot get this one.

This is what I have tried: regex to find a string, excluding comments With this post its using a negative lookbehind. I cannot use a negative lookbehind because this regex is being used in Perl 5.32.1 on a Windows 10 machine.

This is the best I could come up with:

(\bfunction[\w]+\b[^:,])

Were it excludes line matches with : or , but it doesn't exclude duplicates that are enclosed inside the /* */. But I haven't been able to figure it out other than using a negative lookbehind which I cannot use.

Ultimately, I think the best solution would be to exclude everything in between /* */ and only search for things that are not contained with the comments. But it would need to support exclusion of multline comment content and not use a negative lookbehind.

Testing Round 1 Testing Round 1

This is not not a complete answer to my problem becuase it doesn't omit the const & space in front of the function name. The function1, function 2 etc are just generic function names. They would be alphanumeric so believe function[\w]+ still provides the best capture for the function names.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • Is `(function[\w]+)` what you're looking for? @LearnToBeBetter – lemon Apr 13 '22 at 00:15
  • No, this will grab all occurrences of both in the comments and outside the comments resulting in lots of duplicates. – LearnToBeBetter Apr 13 '22 at 02:48
  • If you're just trying to grab the "function# =" ones, how about this: `if (/(function.+?) =/) { print "$1"; }` Prints: **function1 function2 function3 function4 function5** – Keith Gossage Apr 13 '22 at 04:13
  • now I got you're task, check out my answer @LearnToBeBetter – lemon Apr 13 '22 at 11:01
  • I don't understand exactly which function names you need. Is it: (1) the ones from the definition? (2) or the ones from the comments, but only those that have the following definition? Why is `/(function[0-9]+):/` (--> `$1` is then `function1` etc) not good? Or, to make more sure what you're getting, even `/Object:\s+(function[0-9]+):/`. – zdim Apr 13 '22 at 16:07
  • @zdim the ones from the definition. Omitting the ones in the comments – LearnToBeBetter Apr 13 '22 at 16:42
  • You can't have a regex that finds `function` and gives it once without checking the context in which that phrase is found. In order to check the context, you can't avoid backreferencing. In backreferencing what you're doing is matching in first place context+needed_words (which you enclose in parentheses), then you're able to extract the submatch you need by referencing the content of the enclosed parentheses. The operation of backreferencing is a post-processing that you do after getting context+needed_words match, and that you can't simulate inside regexr.com. – lemon Apr 13 '22 at 16:43
  • Or I just grab all the data from const to the end of the function name and load that into a Perl array. Use a loop to iterate through the array and trim off everything before the f in the function name, right? – LearnToBeBetter Apr 13 '22 at 16:49
  • @LearnToBeBetter "_the ones from the definition. Omitting the ones in the comments_" -- but only from those definitions that are preceded by comments that mention the same function ? (And is there always a `:` following the function name in those comments?) – zdim Apr 13 '22 at 17:15
  • @zdim, correct. The was my thought at least was to make sure I omit all comments, multiline included. This way the matching only focuses on the definitions. Even further, it should only focus on the definition name not any of the declarations for the definition name. For example, "const boolean function_some_name = someValue1" I'm looking to only match with the function_some_name and omit all the other surrounding text. – LearnToBeBetter Apr 13 '22 at 17:21
  • @LearnToBeBetter OK, posted a (working for me) program, let me know. Another option is to parse the file by (whole) comments, which will be less picky. Will add that when I get to it. Please keep in mind, in all this: finding (correctly!) comments in a C file _in general_ can be really really tricky. – zdim Apr 13 '22 at 18:03
  • @LearnToBeBetter Updated my answer with another approach – zdim Apr 14 '22 at 07:28

2 Answers2

4

My take on the problem: Find a function name from its defintion (followed by =) outside of comments but following a comment where it was mentioned (followed by :).

Here is a simple, step-by-step state-full approach: Detect whether we are inside a comment, and whether we find /(function[0-9]+):/, and set suitable flags; then look for the same function after the comment and update flags.

use warnings;
use strict;
use feature 'say';

my $file = shift // die "Usage: $0 filename\n";

open my $fh, '<', $file or die $!; 

my (@func_names, $inside_comment, $func_name);
while (<$fh>) { 
    chomp;
    # Are we inside a comment? Look for function[0-9]+: 
    if (m{/\*}) {                           #/ fix syntax hilite
        $inside_comment = 1 if not m{\*/};  #/ starts multiline comment?
        if (/(function[0-9]+):/) { 
            $func_name = $1; 
        }
    }   
    elsif (m{\*/}) {         #/ closing line for multiline comment
        $inside_comment = 0;  
        if (not $func_name and /(function[0-9]+):/) {   #/
            $func_name = $1; 
        }
    }   
    elsif ($inside_comment and not $func_name) { 
        if (/(function[0-9]+):/) { 
            $func_name = $1; 
        }
    }   
    # Check for name outside (after) a comment where it was found
    elsif (not $inside_comment and $func_name) { 
        if (/(function[0-9]+)\s+=/) { 
            say "Found our definition: $1";
            push @func_names, $1;
            $func_name = ''; 
        }
    }   
}
say for @func_names;

This prints as expected with a supplied sample. A downside: each line is tested twice with a regex. For small files, like source code, one will never notice but it just isn't nice. There may be (edge?) cases which aren't covered, please test and improve.


Another option. Read the whole file into a string and step through it by comments, checking for function name after each; or, parse it for comments + function-definition. Both use \G + /g.

use warnings;
use strict;
use feature 'say';

die "Usage $0 filename\n" if not @ARGV;
my $cont = do { local $/; <> };

# Pattern inside a C-style comment, possibly multiline (NOTE: not general)
my $re_cc = qr{/\* .*? (function[0-9]+): .*? \*/}sx;

my @func_names;

while ($cont =~ /$re_cc\s*/gc) { 
    my $func_name = $1;
    if ( $cont =~ /\G .*? (function[0-9]+)\s*=/x and $func_name eq $1 ) {
        push @func_names, $1;
    }
}

See about the anchor \G and its use in combination with the modifier /g in perlop. Some other resources are this page and this post (and there's more).

This makes some assumptions and perhaps a slightly safer version is

use warnings;
use strict;
use feature 'say';

die "Usage $0 filename\n" if not @ARGV;
my $cont = do { local $/; <> };

# Pattern inside a C-style comment, possibly multiline (NOTE: not general)
my $re_cc = qr{/\* .*? (function[0-9]+): .*? \*/}sx;

my (@func_names, $func_name);    
while (1) {
    if ($cont =~ /\G $re_cc \s*/gcx) { 
        $func_name = $1;
    }
    elsif ($cont =~ /\G (function[0-9]+)\s* = .*?\n\s*/gcx 
            and $func_name eq $1) {
        #say "Found function definition for: $1 (at pos=", pos $cont, ")";
        push @func_names, $1;
        $func_name = '';
    }
    elsif ($cont =~ /\G \S+ \s*/gcx) { }       # other, skip
    else                             { last }

}

say for @func_names;

These both process the supplied file correctly, but can surely be improved for more general cases.

Please keep in mind that identifying, in general and correctly, C-style comments may be very tricky. See this perldoc FAQ


One: If a comment isn't followed by a definition our flags may stay in a faulty state


While the note about improving is about its general operation here is a related regex comment.

The $re_cc pattern uses the /s modifier so that . matches a newline as well, as it must in order for .* to match across multiple lines. However, this way the modifier is set globally and it applies to the rest of a regex in which this pattern is used! Well, that may not be intended.

In this case, I can't see how it would matter but in case it may there is a way to set an embedded (pattern-match) modifier, which applies only with the pattern

/(?s)pattern(?-s)/

or, if the pattern naturally works inside its group the modifier is dropped outside of it so we don't need to cancel it with (?-s)

/((?s)pattern)/
zdim
  • 64,580
  • 5
  • 52
  • 81
1

Try using const as a filter:

"const (\bfunction[\w]+\b[^:,])"

It won't allow for other neighbours, giving you unique values of function names.

In order to get your group, you need to reference \1 and you'll get only the function name.

lemon
  • 14,875
  • 6
  • 18
  • 38