1

I'm creating a powershell script that parses a file containing C code and detects if it contains calls to free(), malloc() or realloc() functions.

file_one.c

int MethodOne()  
{
    return 1;
}   
int MethodTwo()    
{   
    free();
    return 1;
} 

file_two.c

int MethodOne()  
{
    //free();
    return 1;
}
int MethodTwo()    
{       
    free();
    return 1;
} 

check.ps1

$regex = "(^[^/]*free\()|(^[^/]*malloc\()|(^[^/]*realloc\()"
$file_one= "Z:\PATH\file_one.txt"
$file_two= "Z:\PATH\file_two.txt"

$contentOne = Get-Content $file_one -Raw 
$contentOne -match $regex

$contentTwo = Get-Content $file_two-Raw 
$contentTwo -match $regex

processing the whole file in a time seems to work well with contentOne, in fact I get True (because of the free() in MethodTwo). Processing contentTwo is not so lucky and returns False instead of True (because of the free() in MethodTwo).
Can someone help me to write a better regex that works in both cases?

fitzbutz
  • 956
  • 15
  • 33
  • Have you looked at this? http://stackoverflow.com/a/20961630 – Arithmomaniac Jul 28 '16 at 18:06
  • That other regex doesn't handle comments. However, I did find if I removed the `-Raw` and used `($contentTwo -match $regex).Count -ne 0`, it returns `True`. Another alternative is `$contentTwo | select-string -Pattern $regex -Quiet` – Eris Jul 28 '16 at 18:12
  • @Eris doing this turns out the problem that every call to this functions made whithin a multiline comment /* */will be considered as a match – fitzbutz Jul 28 '16 at 18:41
  • Truthfully, regex is a poor choice for this, since C has a formal grammar. You're better off running the files through the C preprocessor to get the tokens – Eris Jul 28 '16 at 18:58
  • There is a regex that excludes comments and string literals then finds your target words. Want it? –  Jul 28 '16 at 19:11

1 Answers1

1

Sure, this is it

Raw:

^(?>(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n))|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\b(?:free|malloc|realloc)\()[\S\s](?:(?!\b(?:free|malloc|realloc)\()[^/"'\\])*))*(?:(\bfree\()|(\bmalloc\()|(\brealloc\())

Stringed:

"^(?>(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\(?:\\r?\\n)?)*?(?:\\r?\\n))|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\b(?:free|malloc|realloc)\\()[\\S\\s](?:(?!\\b(?:free|malloc|realloc)\\()[^/\"'\\\\])*))*(?:(\\bfree\\()|(\\bmalloc\\()|(\\brealloc\\())"

Verbatim:

@"^(?>(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n))|(?:""[^""\\]*(?:\\[\S\s][^""\\]*)*""|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\b(?:free|malloc|realloc)\()[\S\s](?:(?!\b(?:free|malloc|realloc)\()[^/""'\\])*))*(?:(\bfree\()|(\bmalloc\()|(\brealloc\())"

Explained

 ^ 
 (?>
      (?:                              # Comments 
           /\*                              # Start /* .. */ comment
           [^*]* \*+
           (?: [^/*] [^*]* \*+ )*
           /                                # End /* .. */ comment
        |  
           //                               # Start // comment
           (?:                              # Possible line-continuation
                [^\\] 
             |  \\ 
                (?: \r? \n )?
           )*?
           (?: \r? \n )                     # End // comment
      )
   |                                 # OR,

      (?:                              # Non - comments 
           "
           [^"\\]*                          # Double quoted text
           (?: \\ [\S\s] [^"\\]* )*
           "
        |  '
           [^'\\]*                          # Single quoted text
           (?: \\ [\S\s] [^'\\]* )*
           ' 
        |                                 # OR,

           (?!                              # ASSERT: Here, cannot be free / malloc / realloc {}
                \b 
                (?: free | malloc | realloc )
                \(
           )
           [\S\s]                           # Any char which could start a comment, string, etc..
                                            # (Technically, we're going past a C++ source code error)

           (?:                              # -------------------------
                (?!                              # ASSERT: Here, cannot be free / malloc / realloc {}
                     \b 
                     (?: free | malloc | realloc )
                     \(
                )

                [^/"'\\]                         # Char which doesn't start a comment, string, escape,
                                                 # or line continuation (escape + newline)
           )*                               # -------------------------
      )                                # Done Non - comments 
 )*

 (?:
      ( \b free\( )                    # (1), Free()
   |  
      ( \b malloc\( )                  # (2), Malloc()
   |  
      ( \b realloc\( )                 # (3), Realloc()
 )

Some notes:

This only finds the first one from the beginning of string using ^ anchor.
To find them all, just remove the ^ from the regex.

This works because it matches everything up to what you're looking for.
In this case, what it found is in capture group 1, 2, or 3.

Good Luck !!


What the regex contains:

----------------------------------
 * Format Metrics
----------------------------------
Atomic Groups       =   1

Cluster Groups      =   10

Capture Groups      =   3

Assertions          =   2
       ( ? !        =   2

Free Comments       =   25
Character Classes   =   12

edit
Per request, explanation of the part of the regex that handles
/**/ comments. This -> /\*[^*]*\*+(?:[^/*][^*]*\*+)*/

This is a modified unrolled-loop regex that assumes an opening delimiter
of /* and a closing one of */.
Notice that the open/close share a common character / in it's delimiter
sequence.
To be able to do this without lookaround assertions, a method is used
to shift the trailing delimiter's asterisk inside the loop.
Using this factoring, all that's needed is to check for a closing /
to complete the delimited sequence.

 /\*              # Opening delimiter /*

 [^*]*            # Optionally, consume all non-asterisks

 \*+              # This must be 1 or more asterisks anchor's or FAIL.
                  # This is matched here to align the optional loop below
                  # because it is looking for the closing /.

 (?:              # The optional loop part
      [^/*]            # Specifically a single non / character (nor asterisk).
                       # Since a / will be the next closing delimiter, it must be excluded.

      [^*]*            # Optional non-asterisks.
                       # This will accept a / because it is supposed to consume ALL
                       # opening delimiter's as it goes
                       # and will consider the very next */ as a close.

      \*+              # This must be 1 or more asterisks anchor's or FAIL.
 )*               # Repeat 0 to many times.

 /                # Closing delimiter /
  • There are some double quotes in the raw regex. If your shell doesn't use escapes for strings, you'll have to escape it with double/triple escapes ? –  Jul 28 '16 at 19:58
  • Also, this regex uses newline `\n` only detection. If you open the file translated, you may have to change them to `(?:\r\n)` verbatim, or `(?:\r?\n)` covers both. –  Jul 28 '16 at 20:08
  • I don't understand very well the part of comments (both // and /* */). Could you kindly explain these parts with more details? – fitzbutz Jul 29 '16 at 10:01
  • @GiorgioGambino - First of all, did the regex work ? The `/**/` form uses this `/\*[^*]*\*+(?:[^/*][^*]*\*+)*/` it's a specially designed modified unrolled-loop that is fast and distinctly matches from here `/*` to here `*/` that will consume any new `/*` in the middle. The `//` form uses this `//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n)` which passively (optionally) consumes line continuation `escape + newline` until it finds an un-escaped newline that terminates the line comment `//` –  Jul 29 '16 at 15:36
  • Yes it works well! What is not clear to me is the central part `[^*]*\*+(?:[^/*][^*]*\*+)` the other part (the one for // comments ) is now clear. – fitzbutz Aug 01 '16 at 07:48
  • @GiorgioGambino - The `/**/` parsing part is just an unrolled loop as I said. What makes it strange is that the open/close delimiter share a forward slash which would require an assertion. This is how to do it without assertion's. I've edit my post with a detailed description. If this were a test of speed, this would be the fastest possible way for this to be done. –  Aug 02 '16 at 16:10