2

I have a problem with removing single line comments from C source code using php and regex.

/* */ comments are removed so all that is left are // comments.

First:

I have this regex: $content = ereg_replace('^\/\/.*$', '', $content);

This will remove whole file (not only all lines that have ^//comment$). I assume because its greedy search, but how can I make it non-greedy? And how can I make it do it for all lines that match?

Second:

Problem is that they shouldn't be removed while in string "//shall not be removed". How can I make this happen? I was thinking like when it finds " char then it should skip it but I have no idea how to do it.

Thanks to all who helps, I really appriciate it.

revo
  • 47,783
  • 14
  • 74
  • 117
Shadowmak
  • 111
  • 13
  • 3
    This is harder than you think to do correctly. Imagine lines like this: `char * foo = "//don't remove" " // really please don't"; // But remove this ha ha! // along with this!` and then imagine a copy of that entire line *commented out* on the next line of source. I'd question the entire task and see if there isn't a C source manipulation library you should be using instead. – BaseZen Mar 07 '15 at 18:20
  • Is it? So when I have the choise between doing this by regex or final state machine, u recommend the FSM, right? :/ – Shadowmak Mar 07 '15 at 18:23
  • It gets worse. `char *bad_regex_strategy = "\"good luck\" // \"counting the \"quotes\"" "//\"//"; // \"ha""" \" " \" ha ha\" ha /// "\;"ha'` – BaseZen Mar 07 '15 at 18:31
  • @BaseZen: There is no need to "[count] the quotes", when you try to do that, the strategy consists to match quoted (or heredoc) parts first in the pattern. – Casimir et Hippolyte Mar 07 '15 at 20:42
  • OK, I meant 'matching', and the complexity argument stands. – BaseZen Mar 07 '15 at 20:54

2 Answers2

1

This will match all single line comments but those wrapped by double quotation marks ". Even those they say in comments!

(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)//.*$

Live demo

* All rights reserved for this answer

Community
  • 1
  • 1
revo
  • 47,783
  • 14
  • 74
  • 117
1

To avoid the string trap, a way consists to match what you want to avoid first and to capture it or to skip it.

ereg_ functions are deprecated since PHP 5.3, however it's always possible to use them:

$result = ereg_replace('("([^\\\"]|\\\.)*")|//[^' . "\n" . ']*|/\*\**([^*]|\*\**[^*/])*\*\**/', '\1', $str);

It works but the performances are very poor if you compare with the preg version (that has a lot of features to improve the pattern):

$pattern2 = '~
    " [^"\\\]* (?s: \\\. [^"\\\]* )*+ " # double quoted string
    (*SKIP)(*F) # forces the pattern to fail and skips the previous substring
  |
    /
    (?:
        / .* # singleline comment 
      |
        \*   # multiline comment 
        [^*]* (?: \*+(?!/) [^*]* )*+  
        (?: \*/ )? # optional to deal with unclosed comments
    )
~xS';

$result = preg_replace($pattern2, '', $str);

online demo

The preg version is about 450x faster than the ereg_ version.

details of the subpattern [^*]* (?: \*+(?!/) [^*]* )*+:

This subpattern describes the content of a multiline comment, so all between /* and the first */:

[^*]*           # all that is not an asterisk (can be empty)

(?:             # open a non capturing group:
                # The reason of this group is to handle asterisks that
                # are not a part of the closing sequence */

    \*+         # one or more asterisks 
    (?!/)       # negative lookahead : not followed by / 
                # (it is a zero-width assertion, in other words it's only a test 
                # and it doesn't consume characters)

    [^*]*       # zero or more characters that are not an asterisk
)*+             # repeat the group zero or more times (possessive)

Regex engine walk (about) for the string /*aaaa**bbb***cc***/:

/*aaaa**bbb***cc***/       /\*[^*]* (?: \*+(?!/) [^*]* )*+ \*/     succeed
/*aaaa**bbb***cc***/     /\*[^*]*(?: \*+(?!/) [^*]* )*+ \*/     succeed
/*aaaa**bbb***cc***/         /\* [^*]*(?: \*+(?!/) [^*]* )*+\*/     try group
/*aaaa**bbb***cc***/     /\* [^*]* (?:\*+(?!/) [^*]* )*+ \*/   succeed
/*aaaa**bbb***cc***/     /\* [^*]* (?: \*+(?!/)[^*]* )*+ \*/   verified
/*aaaa**bbb***cc***/     /\* [^*]* (?: \*+(?!/)[^*]*)*+ \*/     succeed
/*aaaa**bbb***cc***/         /\* [^*]*(?: \*+(?!/) [^*]* )*+\*/     try group
/*aaaa**bbb***cc***/     /\* [^*]* (?:\*+(?!/) [^*]* )*+ \*/   succeed
/*aaaa**bbb***cc***/     /\* [^*]* (?: \*+(?!/)[^*]* )*+ \*/   verified
/*aaaa**bbb***cc***/     /\* [^*]* (?: \*+(?!/)[^*]*)*+ \*/     succeed
/*aaaa**bbb***cc***/         /\* [^*]*(?: \*+(?!/) [^*]* )*+\*/     try group
/*aaaa**bbb***cc***/     /\* [^*]* (?:\*+(?!/) [^*]* )*+ \*/   succeed
/*aaaa**bbb***cc***/       /\* [^*]* (?: \*+(?!/)[^*]* )*+ \*/   fail
/*aaaa**bbb***cc***/     /\* [^*]* (?:\*+(?!/) [^*]* )*+ \*/   backtrack
/*aaaa**bbb***cc***/     /\* [^*]* (?: \*+(?!/)[^*]* )*+ \*/   verified
/*aaaa**bbb***cc***/          /\* [^*]* (?: \*+(?!/)[^*]*)*+ \*/    succeed
/*aaaa**bbb***cc***/         /\* [^*]*(?: \*+(?!/) [^*]* )*+\*/     try group
/*aaaa**bbb***cc***/     /\* [^*]* (?:\*+(?!/) [^*]* )*+ \*/   succeed
/*aaaa**bbb***cc***/       /\* [^*]* (?: \*+(?!/)[^*]* )*+ \*/   fail
/*aaaa**bbb***cc***/         /\* [^*]*(?: \*+(?!/) [^*]* )*+\*/     fail
/*aaaa**bbb***cc***/     /\* [^*]*(?: \*+(?!/) [^*]* )*+\*/     backtrack
/*aaaa**bbb***cc***/       /\* [^*]* (?: \*+(?!/) [^*]* )*+\*/      succeed

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Thank you so much! It is working nicely :) I guess its time to learn preg regular expresion. Just one question. Can u please explain me a bit this line: [^*]* (?: \*+(?!/) [^*]* )*+ ? Its like I dont want stars 0...n times and then I choose between what? – Shadowmak Mar 08 '15 at 11:30
  • @user3457394: I will add the detail of this line in my answer. – Casimir et Hippolyte Mar 08 '15 at 12:30
  • Beautiful :) really appriciate it. Thank you a lot! – Shadowmak Mar 08 '15 at 16:07