What you are trying to do is dangerous.
Regular expressions aren't powerful enough to parse a language as complex as C++. You can find a great discussion here (though it's about HTML in that case, but what is said there still applies). Proper C++ parsing requires a full-fledged parser. According to some comments that I have read here and there while I was researching this topic myself, C++ is actually so difficult that most commercial parsers can't do it right, as there are simply too many edge cases. As suggested in the answer I have linked, however, trying a regexp based approach is, under some circumstances, possible. But you have to make sure your data follow some patterns, and normally it is difficult to make such an assumption.
That said... Your code doesn't even compile. You have to fix your regexp like this:
while ($data =~ /(.*::.*)/g ) {
But this means you will only find functions that are members of a class, and you will also get some false positives because the class::function syntax can also be used to call functions, not only to define them, so I'd look for the semicolon at the end of their declaration in the .h file. And namespaces also use the same ::
notation. When I was trying to write my own regexp to parse C++ (before discovering that it can't be done, as explained above) I was trying to find something like this:
#!/usr/bin/perl
use strict;
use warnings;
my $data = "int& myClass::Function1();\n"
. "void * me::function2(const int& temp, double a, char[] b);\n"
. "double** class::function_3 (int[] array, int& result);\n";
while ($data =~ /\s*(\w+([\s&\*]*))((::)?((\w+)::)?(\w+)\s*\(([^)]*)\)\s*;)/gs ) {
my $return_type = $1;
my $class = $6;
my $function_name = $7;
my $arguments = $8;
print "return_type = $return_type\n";
print "class = $class\n";
print "function_name = $function_name\n";
print "arguments = $arguments\n";
}
As you can see, this regexp is already pretty complex, and still there are a lot of cases that it can't catch (what about namespaces, templates, multi-line functions with possibly an argument + comment per line? And so on...). If you really want to go this way, try a test-based approach:
- Analyse the format of your data, that is, the function names that you want to consider (for example: do they use namespaces? Do they return references, pointers and so on? In that case, do they have spaces in between?)
- Create a test suite, that is, a list of functions called function1, function2, function3... making sure that you have one case for each possible syntax (this is the hard part, because how can you be sure that you have considered them all?)
- Write a regexp that covers as many cases as possible. If you can't cover all of them with one, consider using more than one (in the example I gave it would be more than one
while
loop). Every time you have a match, print it. At the end, check that you have found all the functions in your test.
If you can do all of this, and if you have done a very good job at defining the test cases, you can succeed. But let me repeat that regular expressions aren't the right tool for this, and they work only in a limited set of cases, and even determining whether they do in yours is hard.
Again: consider a parser!