-2

I have an html-file that has some sections that need to be removed. All section will be removed except one. I was able to give you a small example, however it is pretty weird that a regex editor recognizes the section.

I want to remove everything between <!-- and -->, but it doesn't work.

test = '<br/><br/>    </span>    <!--TABLE<table class=MsoTableGrid border=1 cellspacing=0 cellpadding=0 style=\'border-collapse:collapse;border:none\'>        <tr style=\'height:12.95pt\'>            <td width=225 valign=top style=\'width:109.45pt;border:solid windowtext 1.0pt;padding:2.4pt 5.4pt 2.4pt 5.4pt;height:12.95pt\'>                <span style=\'font-family:"Arial",sans-serif\'>                    <b>Kontosaldo in \x80</b>                </span>            </td>        </tr>        <tr style=\'height:12.95pt\'>            <td width=146 valign=top style=\'width:109.45pt;border:solid windowtext 1.0pt;padding:2.4pt 5.4pt 2.4pt 5.4pt;height:12.95pt\'>                <span style=\'font-family:"Arial",sans-serif\'>                    [substringR]                </span>            </td>        </tr>    </table>TABLE-->'
r = re.compile(r"(?<=<!--)([\s\n.<>\]\[\\=;,€\/\-\'\":\w\n]+)(?=-->)")
mystring = r.sub('', test)
Erik Steiner
  • 581
  • 1
  • 5
  • 18
  • Not related directly to the question, but i'de use BeautifulSoap instead of complicating with regexes.. Something like here: https://stackoverflow.com/questions/33138937/how-to-find-all-comments-with-beautiful-soup – Aaron_ab Jan 14 '19 at 12:02
  • BeautifulSoap is a new one to me @Aaron_ab ! – Jerry Jan 14 '19 at 12:30

1 Answers1

4

"Everything inbetween <!-- and -->" is this expression:

<!--.*?-->

replaced with the empty string. Compile with the re.DOTALL flag.


Note Modifying HTML with regex is a recipe for disaster. Don't do it. This particular task, namely "removing comments" is a grey area: Regex cannot deal with languages that can be arbitrarily nested (such as HTML), but HTML comments cannot be nested, so there is a good chance that this works. However, don't try the same approach with "replacing all tables", it won't work.

But still, HTML can be functional and still horribly broken in soooo many ways, that even for this task there will be HTML files that disintegrate completely when you try this seemingly safe regex on them.

The proper approach is just as @Aaron suggests: Parse the HTML file into a DOM tree. Find nodes you want to remove. Write the DOM tree back to a file; as shown in this answer: How to find all comments with Beautiful Soup.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • It's frightening how simple it is. No wonder it didn't work for me. – Erik Steiner Jan 14 '19 at 16:11
  • But heed my warning - this simplicity is treacherous. Don't try to do more complex tasks than this with regex. Even "modify this attribute value" is a task that calls for an HTML parser, I'm not even talking about structural modifications like "add a table row". – Tomalak Jan 14 '19 at 16:32