Best way to 'clean up' html text

Question

I have the following text:

"It's the show your only friend and pastor have been talking about! 
<i>Wonder Showzen</i> is a hilarious glimpse into the black 
heart of childhood innocence! Get ready as the complete first season of MTV2's<i> Wonder Showzen</i> tackles valuable life lessons like birth, 
nature, diversity, and history &#8211; all inside the prison of 
your mind! Where else can you..."

What I want to do with this is remove the html tags and encode it into unicode. I am currently doing:

def remove_tags(text):
    return TAG_RE.sub('', text)

Which only strips the tag. How would I correctly encode the above for database storage?

Check this topic http://stackoverflow.com/questions/23380171/using-beautifulsoup-extract-text-without-tags — Maksym Kozlenko, Aug 21 '15 at 03:33
Can you please explain , when you say encode it to Unicode , what are you expecting as output ? — Anand S Kumar, Aug 21 '15 at 03:36
By the way, what you're doing with your regexp is wrong. Do not do that. HTML cannot be parsed using regexp so all attempts to do so are bound to fail. Use an HTML parser instead, that's what they're for. — spectras, Aug 21 '15 at 03:59

mhawke · Accepted Answer · 2015-08-21T03:45:31.473

You could try passing your text through a HTML parser. Here is an example using BeautifulSoup:

from bs4 import BeautifulSoup

text = '''It's the show your only friend and pastor have been talking about! 
<i>Wonder Showzen</i> is a hilarious glimpse into the black 
heart of childhood innocence! Get ready as the complete first season of MTV2's<i> Wonder Showzen</i> tackles valuable life lessons like birth, 
nature, diversity, and history &#8211; all inside the prison of 
your mind! Where else can you...'''

soup = BeautifulSoup(text)

>>> soup.text
u"It's the show your only friend and pastor have been talking about! \nWonder Showzen is a hilarious glimpse into the black \nheart of childhood innocence! Get ready as the complete first season of MTV2's Wonder Showzen tackles valuable life lessons like birth, \nnature, diversity, and history \u2013 all inside the prison of \nyour mind! Where else can you..."

You now have a unicode string with the HTML entities converted to unicode escaped characters, i.e. – was converted to \u2013.

This also removes the HTML tags.

here :http://stackoverflow.com/questions/275174/how-do-i-perform-html-decoding-encoding-using-python-django — dsgdfg, Aug 21 '15 at 05:01

Best way to 'clean up' html text

1 Answers1