8

I want to remove HTML comments from an html text

<h1>heading</h1> <!-- comment-with-hyphen --> some text <-- con --> more text <hello></hello> more text

should result in:

<h1>heading</h1> some text <-- con --> more text <hello></hello> more text
CDspace
  • 2,639
  • 18
  • 30
  • 36
Rushabh Mehta
  • 1,463
  • 16
  • 15
  • Using regular expressions on a limited, known set of HTML may be appropriate. However, you should be aware that there are countless cases where it will break and it is generally not advised. – grc Jan 29 '15 at 06:38
  • Related: http://stackoverflow.com/a/1732454/3001761 – jonrsharpe Jan 29 '15 at 07:57
  • Why the downvotes on the question? If you are working on a "known set of HTML" this was a legit question. – Rushabh Mehta Jan 30 '15 at 07:24
  • Consider using a HTML specific library like Beatiful Soup, like this other question-solutions suggests: https://stackoverflow.com/questions/23299557/beautifulsoup-4-remove-comment-tag-and-its-content – hectorcanto Apr 22 '20 at 00:39

6 Answers6

9

You shouldn't ignore Carriage return.

re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
John Hua
  • 1,400
  • 9
  • 15
4
html = re.sub(r"<!--(.|\s|\n)*?-->", "", html)

re.sub basically find the matching instance and replace with the second arguments. For this case, <!--(.|\s|\n)*?--> matches anything start with <!-- and end with -->. The dot and ? means anything, and the \s and \n add the cases of muti line comment.

Shawn
  • 571
  • 7
  • 8
  • 1
    Welcome to [so]! If the OP could understand your code by itself, he probably would not be asking. Please explain what it does, so that it provides value for those who would need to look up a regex. – jpaugh Aug 10 '17 at 17:44
3

Finally came up with this option:

re.sub("(<!--.*?-->)", "", t)

Adding the ? makes the search non-greedy and does not combine multiple comment tags.

Rushabh Mehta
  • 1,463
  • 16
  • 15
2

Don't use regex. Use an XML parser instead, the one in the standard library is more than sufficient.

from xml.etree import ElementTree as ET
html = ET.parse("comments.html")
ET.dump(html) # Dumps to stdout
ET.write("no-comments.html", method="html") # Write to a file
Iskren
  • 1,301
  • 10
  • 15
1
re.sub("(?s)<!--.+?-->", "", s)

or

re.sub("<!--.+?-->", "", s, flags=re.DOTALL)
Dmitry Mottl
  • 842
  • 10
  • 17
0

You could try this regex <![^<]*>

dragon2fly
  • 2,309
  • 19
  • 23