Python html2text regex performance

Question

i have build a html to plain text regex sequence. I use this in up to 100 threads to clean up html files. I want get all visible text information of an given html file.

    self.content = re.sub(r'<!--(.|\n)*?-->', '', self.content)
    self.content = re.sub(r'<script (.|\n)*?>(.|\n)*?</script>', '', self.content)
    self.content = re.sub(r'<style (.|\n)*?>(.|\n)*?</style>', '', self.content)
    self.content = re.sub(r'(<[^>]*?>+)', ' ', self.content)

I am not realy a regex pro. Maybe i could improve the performance of this regex?

I dont want use beautifulsoap or django or html2text c++ distribution. they are after tests slower then my regex. I need just a space separeted string, not a tree or links ect.

Thanks for helping. I know on stackoverflow are some really smart people

What about `
`? What if the closing `` is inside a Javascript string? There's a reason people don't use regexes on html. And I'm obliged to link to this answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Thomas K, Mar 16 '11 at 21:08

score 4 · Answer 1 · answered Mar 16 '11 at 21:31

Use a tool like BeautifulSoup or htmllib and don't try being smarter than the rest of the world. Parsing HTML using regular expressions is the worst thing you can do! There will always be one Html file more where your regexes will fail.

score 0 · Answer 2 · answered Mar 16 '11 at 23:51

There is a common credo according which HTML and XML texts must ne-e-ever be treated with regex tools. You must take into account that the risks of such treatments are real and impossible to manage if it is practiced for too much ambitious aims. HTML and XML are too much complicated markup language to be analysed by regexes.

However I don't totally share this common credo. In my opinion, it isn't a so much absurd method if it is lucidly used with the preoccupation of using regex in conditions that may be reasonbly considered as legitimating this use because the risks seem at the minimum.

I believe that regexes can be used for limited and simple treatments of HTML or XML texts. I really understood here on stacoverflof.com that it is impracticable to parse HTML/XML with regexes. But when a parsing (extracting all or part of a markup tree) isn't implied in a treatment, why to so religiously reject the regexes (I allude to the cited link)
It seems to me that a good security step is to limit the use of a code using regex tools only on texts from a constant origin, and not trying to make it analysing various HTM or XML texts.

After these warnings, I dare to propose to you the following improvements to your REs:

re.sub('<!--.*?-->', '', self.content, flags=re.DOTALL)

and

re.sub('<(script|style) .*?\\1>', '', self.content, flags=re.DOTALL)

Python html2text regex performance

2 Answers2