How to remove HTML comments using Regex in Python

Question

I want to remove HTML comments from an html text

<h1>heading</h1> <!-- comment-with-hyphen --> some text <-- con --> more text <hello></hello> more text

should result in:

<h1>heading</h1> some text <-- con --> more text <hello></hello> more text

Using regular expressions on a limited, known set of HTML may be appropriate. However, you should be aware that there are countless cases where it will break and it is generally not advised. — grc, Jan 29 '15 at 06:38
Why the downvotes on the question? If you are working on a "known set of HTML" this was a legit question. — Rushabh Mehta, Jan 30 '15 at 07:24
Consider using a HTML specific library like Beatiful Soup, like this other question-solutions suggests: https://stackoverflow.com/questions/23299557/beautifulsoup-4-remove-comment-tag-and-its-content — hectorcanto, Apr 22 '20 at 00:39

score 9 · Answer 1 · edited Jan 14 '19 at 12:04

9

You shouldn't ignore Carriage return.

re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)

edited Jan 14 '19 at 12:04

Wiktor Stribiżew

607,720
39
448
563

answered Jan 29 '15 at 06:41

John Hua

1,400
9
15

Why shouldn't we remove the carriage returns as well? – Ethan Jan 03 '16 at 06:36
huazhihao's answer matches comments that have carriage returns within the comment. One of the other answers lacks flags=re.MULTILINE – Greg Lindahl Nov 02 '16 at 02:04
4

actually should be `re.DOTALL`, not `re.MULTILINE`. It's `re.DOTALL` who matches `\n` on `.` – fjsj Feb 14 '17 at 19:36

Shawn · Answer 2 · 2017-08-10T19:34:22.067

4

html = re.sub(r"<!--(.|\s|\n)*?-->", "", html)

re.sub basically find the matching instance and replace with the second arguments. For this case,  matches anything start with . The dot and ? means anything, and the \s and \n add the cases of muti line comment.

edited Aug 10 '17 at 19:34

answered Aug 10 '17 at 16:44

Shawn

571
7
8

1

Welcome to [so]! If the OP could understand your code by itself, he probably would not be asking. Please explain what it does, so that it provides value for those who would need to look up a regex. – jpaugh Aug 10 '17 at 17:44

score 3 · Answer 3 · answered Jan 29 '15 at 06:22

3

Finally came up with this option:

re.sub("()", "", t)

Adding the ? makes the search non-greedy and does not combine multiple comment tags.

answered Jan 29 '15 at 06:22

Rushabh Mehta

1,463
16
15

score 2 · Answer 4 · answered Jan 29 '15 at 09:14

2

Don't use regex. Use an XML parser instead, the one in the standard library is more than sufficient.

from xml.etree import ElementTree as ET
html = ET.parse("comments.html")
ET.dump(html) # Dumps to stdout
ET.write("no-comments.html", method="html") # Write to a file

answered Jan 29 '15 at 09:14

Iskren

1,301
10
15

While this is good advice, the performance of XML parsers is much, much, much slower than this sort of regex. – Greg Lindahl Nov 02 '16 at 02:07

score 1 · Answer 5 · answered Aug 11 '18 at 11:05

1

re.sub("(?s)<!--.+?-->", "", s)

or

re.sub("<!--.+?-->", "", s, flags=re.DOTALL)

answered Aug 11 '18 at 11:05

Dmitry Mottl

842
10
17

score 0 · Answer 6 · answered Jan 29 '15 at 06:36

0

You could try this regex <![^<]*>

answered Jan 29 '15 at 06:36

dragon2fly

2,309
19
23

Your regex matches too much -- note that the question has an example "<-- con -->", which is not an HTML comment. – Greg Lindahl Nov 02 '16 at 02:05
@GregLindahl this regex didn't match "<-- con -->" and returned the result as the OP expected. – dragon2fly Nov 03 '16 at 01:45
2

This won't match a comment with an HTML tag inside of it, like – k-den Sep 02 '20 at 21:57

How to remove HTML comments using Regex in Python

6 Answers6

Linked

Related