i have build a html to plain text regex sequence. I use this in up to 100 threads to clean up html files. I want get all visible text information of an given html file.
self.content = re.sub(r'<!--(.|\n)*?-->', '', self.content)
self.content = re.sub(r'<script (.|\n)*?>(.|\n)*?</script>', '', self.content)
self.content = re.sub(r'<style (.|\n)*?>(.|\n)*?</style>', '', self.content)
self.content = re.sub(r'(<[^>]*?>+)', ' ', self.content)
I am not realy a regex pro. Maybe i could improve the performance of this regex?
I dont want use beautifulsoap or django or html2text c++ distribution. they are after tests slower then my regex. I need just a space separeted string, not a tree or links ect.
Thanks for helping. I know on stackoverflow are some really smart people