0

I'm trying to get rid of php code in a file using regex. Some of the php is not well-formatted, so that there may be extra spaces and/or line breaks. As an example:

<?php require_once('some_sort_of_file.php'); 
                               ?>

I've come up with the following regex which seems to work:

$initial_text  = preg_replace('/\s+/', ' ', $initial_text );  
$initial_text = preg_replace('/' . preg_quote('<?php') . '.*?' . preg_quote('?>') . '/', '', $initial_text);

but was wondering if there might be a way to just use 1 regex statement, in order to speed things up.

Thanks!

Eric
  • 1,209
  • 1
  • 17
  • 34

2 Answers2

2

An even better way to do it: use the built-in tokenizer. Regexes have problems with parsing irregular languages like PHP. The tokenizer, on the other hand, parses PHP code just like PHP itself does.

Sample code:

// some dummy code to play with
$myhtml = '<html>
    <body>foo bar
    <?php echo "hello world"; ?>
    baz
    </body>
    </html>';

// Our own little function to do the heavy lifting
function strip_php($text) {
    // break the code into tokens
    $tokens = token_get_all($text);
    // loop over the tokens
    foreach($tokens as $index => $token) {
        // If the token is not an array (e.g., ';') or if it is not inline HTML, nuke it.
        if(!is_array($token) || token_name($token[0]) !== 'T_INLINE_HTML') {
            unset($tokens[$index]);
        }
        else { // otherwise, echo it or do whatever you want here
            echo $token[1];
        }
    }
}

strip_php($myhtml);

Output:

<html>
<body>foo bar
baz
</body>
</html>

DEMO

Community
  • 1
  • 1
elixenide
  • 44,308
  • 16
  • 74
  • 100
  • Glad to help! I have had to use the tokenizer myself recently. I have some files that have code-generating code like `'; ?>`, which I needed to clean up. A regex will almost always melt down on that. :) – elixenide Mar 16 '14 at 22:54
1

you can put it as a single regex using the s modifier which will allow the dot to match newline chars too. I added the i modifier too to make it case-insensitive.. dunno if you care about that:

$initial_text = preg_replace('~<\?php.*?\?>~si', '', $initial_text );

CrayonViolent
  • 32,111
  • 5
  • 56
  • 79
  • Thanks! This did it. And, though I didn't care about it being case-insensitive, it was a helpful bonus. – Eric Mar 16 '14 at 22:53