Regex : Remove all comments from html file BUT preserve same number of lines

Question

If a comment in a file covers 6 of it's lines, the comment should be removed and replaced with empty lines which equal the comment's number of lines.

Here is a small demonstrations of what i mean. Given file.html has 10 lines :

    line 1 : <!-- text
    line 2 :      text
    line 3 :      text
    line 4 :      empty line
    line 5 :      text
    line 6 : -->
    line 7 :empty line
    line 8 :text
    line 9 :empty line
    line 10 :text

The expected output would be :

    line 1 :empty line
    line 2 :empty line
    line 3 :empty line
    line 4 :empty line
    line 5 :empty line
    line 6 :empty line
    line 7 :empty line
    line 8 :text
    line 9 :empty line
    line 10 :text

The pattern i am currently using preg_replace('/(?=/', '', $contents); replaces the content of the file with empty string which doesnt not preserve the same number of lines that the file previously had.

Note that any solution needs to keep the structure of the file as it was such that the text on line 8 and 10 don't change position within the file.

Edit : no idea why this was flagged as duplicate. In no way is it similar to the supposed duplicated question given how that one wants to generally know how one can go about parsing the dom as opposed to my very specific and centered question about removing commented text within a file without altering the number of lines in that file.

Maybe use `preg_replace_callback`, and inside of there count the returns in the callback and use that? — Chris Haas, May 13 '21 at 19:51
Something like this: `reg_replace_callback('/(?=/', function ($m) { return preg_replace('/[^\n]/','', $m[0]); }, $contents);` — micke, May 13 '21 at 20:48
There is no way this is a duplicate of given reference like question. This is not about parsing XML or HTML in php. Please read edited part of question again where OP has clearly explained the difference — anubhava, May 14 '21 at 03:55
This question has been indeed wrongly marked as a duplicate. The OP has explained well why this question is not related to the dupe target. — Arvind Kumar Avinash, May 14 '21 at 05:41

score 6 · Accepted Answer · answered May 13 '21 at 20:29

You may use this search for searching:

(?:^\h*<!--|(?<!\A|-->\n)\G).*\R

and replace that with a "\n"

RegEx Demo

RegEx Details:

(?:: Start non-capture group
- ^: Start of a line
- \h*\n): Negative lookbehind to avoid match if we have either start position or we have --> + line break at previous position
- \G: Match end position of previous match
): End non-capture group
.*\R: Match remaining characters in line followed by line break

To match inline comments aswell, removing the start of line symbole `^` seems to do the trick : `$formattedContents = preg_replace('/(?:\n)\G).*\R/m', '\n', $contents);` — coderdonezo, Jun 16 '21 at 16:09

Regex : Remove all comments from html file BUT preserve same number of lines

1 Answers1