26

I want to parse a file and I want to use php and regex to strip:

  • blank or empty lines
  • single line comments
  • multi line comments

basically I want to remove any line containing

/* text */ 

or multi line comments

/***
some
text
*****/

If possible, another regex to check if the line is empty (Remove blank lines)

Is that possible? can somebody post to me a regex that does just that?

Thanks a lot.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Ahmad Fouad
  • 3,957
  • 14
  • 45
  • 62

9 Answers9

52
$text = preg_replace('!/\*.*?\*/!s', '', $text);
$text = preg_replace('/\n\s*\n/', "\n", $text);
chaos
  • 122,029
  • 33
  • 303
  • 309
  • Thanks a lot! The first regex removed single line comments. However the second regex did no change and didn't remove multi line comments. I appreciate your response..thanks again – Ahmad Fouad Mar 13 '09 at 15:12
  • Make sure you have the !s on the first regex; it wasn't in my initial answer. That's what makes it handle multiline comments. The second pattern removes empty lines. – chaos Mar 13 '09 at 15:17
  • The !s makes it work 100%. It works much better than my regex, +1 from me. – St. John Johnson Mar 13 '09 at 15:29
  • 1
    Thanks! This worked for me. But my code also had common comments // Like so. I managed to also clear these with this regex ```$strData = preg_replace('/(?:(?:\/\*(?:[^*]|(?:\*+[^*\/]))*\*+\/)|(?:(?<!\:|\\\|\'|\")\/\/.*))/', '', $strData);```, that I got from this source: https://stackoverflow.com/a/31907095/2510785 – Jorge Mauricio Apr 17 '22 at 21:31
12

Keep in mind that any regex you use will fail if the file you're parsing has a string containing something that matches these conditions. For example, it would turn this:

print "/* a comment */";

Into this:

print "";

Which is probably not what you want. But maybe it is, I don't know. Anyway, regexes technically can't parse data in a manner to avoid that problem. I say technically because modern PCRE regexes have tacked on a number of hacks to make them both capable of doing this and, more importantly, no longer regular expressions, but whatever. If you want to avoid stripping these things inside quotes or in other situations, there is no substitute for a full-blown parser (albeit it can still be pretty simple).

Chris Lutz
  • 73,191
  • 16
  • 130
  • 183
7
//  Removes multi-line comments and does not create
//  a blank line, also treats white spaces/tabs 
$text = preg_replace('!^[ \t]*/\*.*?\*/[ \t]*[\r\n]!s', '', $text);

//  Removes single line '//' comments, treats blank characters
$text = preg_replace('![ \t]*//.*[ \t]*[\r\n]!', '', $text);

//  Strip blank lines
$text = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $text);
makaveli_lcf
  • 113
  • 1
  • 5
  • 3
    The single line comment replace doesn't work when there are URLs involved. `https://example.com` is also replaced. – ascx May 15 '17 at 14:49
3
$string = preg_replace('#/\*[^*]*\*+([^/][^*]*\*+)*/#', '', $string);
2

It is possible, but I wouldn't do it. You need to parse the whole php file to make sure that you're not removing any necessary whitespace (strings, whitespace beween keywords/identifiers (publicfuntiondoStuff()), etc). Better use the tokenizer extension of PHP.

soulmerge
  • 73,842
  • 19
  • 118
  • 155
  • I want to count on regex only. The file is too simple, it has couple of single line comments, multi line comment, and some PHP codes (each in a new line) .. i just want a regex formula that makes a clean-up...so i can use the output in the browser for different use. – Ahmad Fouad Mar 13 '09 at 15:18
  • Be aware that the regex-only approach will miss "here documents". To properly identify such text you really do need to use a tokenizer. – Peter Jan 28 '13 at 18:09
2

This should work in replacing all /* to */.

$string = preg_replace('/(\s+)\/\*([^\/]*)\*\/(\s+)/s', "\n", $string);
St. John Johnson
  • 6,590
  • 7
  • 35
  • 56
1

This is a good function, and WORKS!

<?
if (!defined('T_ML_COMMENT')) {
   define('T_ML_COMMENT', T_COMMENT);
} else {
   define('T_DOC_COMMENT', T_ML_COMMENT);
}
function strip_comments($source) {
    $tokens = token_get_all($source);
    $ret = "";
    foreach ($tokens as $token) {
       if (is_string($token)) {
          $ret.= $token;
       } else {
          list($id, $text) = $token;

          switch ($id) { 
             case T_COMMENT: 
             case T_ML_COMMENT: // we've defined this
             case T_DOC_COMMENT: // and this
                break;

             default:
                $ret.= $text;
                break;
          }
       }
    }    
    return trim(str_replace(array('<?','?>'),array('',''),$ret));
}
?>

Now using this function 'strip_comments' for passing code contained in some variable:

<?
$code = "
<?php 
    /* this is comment */
   // this is also a comment
   # me too, am also comment
   echo "And I am some code...";
?>";

$code = strip_comments($code);

echo htmlspecialchars($code);
?>

Will result output as

<?
echo "And I am some code...";
?>

Loading from a php file:

<?
$code = file_get_contents("some_code_file.php");
$code = strip_comments($code);

echo htmlspecialchars($code);
?>

Loading a php file, stripping comments and saving it back

<?
$file = "some_code_file.php"
$code = file_get_contents($file);
$code = strip_comments($code);

$f = fopen($file,"w");
fwrite($f,$code);
fclose($f);
?>

Source: http://www.php.net/manual/en/tokenizer.examples.php

Eduardo Cuomo
  • 17,828
  • 6
  • 117
  • 94
  • This works great. But there is one problem, it doest not remoive empty lines from where the comments are removed. If a file contains 500 lines of comments then the words are removed but the empty lines will still be there. Can you tell us the proper way of removing these empty lines. – asim-ishaq May 02 '13 at 07:06
  • To result, apply next to remove empty lines: preg_replace('/\n\s*\n/', '', $code) or next to remove only empty lines of start: preg_replace('/^\n\s*\n/', '', $code) – Eduardo Cuomo Jun 07 '13 at 15:02
0

This is my solution , if one is not used to regexp. The following code remove all comment delimited by # and retrieves the values of variable in this style NAME=VALUE

  $reg = array();
  $handle = @fopen("/etc/chilli/config", "r");
  if ($handle) {
   while (($buffer = fgets($handle, 4096)) !== false) {
    $start = strpos($buffer,"#") ;
    $end   = strpos($buffer,"\n");
     // echo $start.",".$end;
       // echo $buffer ."<br>";



     if ($start !== false)

        $res = substr($buffer,0,$start);
    else
        $res = $buffer; 
        $a = explode("=",$res);

        if (count($a)>0)
        {
            if (count($a) == 1 && !empty($a[0]) && trim($a[0])!="")
                $reg[ $a[0] ] = "";
            else
            {
                if (!empty($a[0]) && trim($a[0])!="")
                    $reg[ $a[0] ] = $a[1];
            }
        }




    }

    if (!feof($handle)) {
        echo "Error: unexpected fgets() fail\n";
    }
    fclose($handle);
}
gdm
  • 7,647
  • 3
  • 41
  • 71
0

I found this one to suit me better, (\s+)\/\*([^\/]*)\*/\n* it removes multi-line, tabbed or not comments and the spaced behind it. I'll leave a comment example which this regex would match.

/**
 * The AdditionalCategory
 * Meta informations extracted from the WSDL
 * - minOccurs : 0
 * - nillable : true
 * @var TestStructAdditionalCategorizationExternalIntegrationCUDListDataContract
 */
Rogerio Dalot
  • 11
  • 1
  • 3