Python HTML removal

Question

How can I remove all HTML from a string in Python? For example, how can I turn:

blah blah <a href="blah">link</a>

into

blah blah link

Thanks!

Might be overkill for your purposes, but give BeautifulSoup a try if your strings have more complicated or malformed HTML. Caveat: I don't think it's available for Python 3.0 yet. — mechanical_meat, Feb 28 '09 at 22:51

score 18 · Answer 1 · answered Mar 01 '09 at 02:00

18

When your regular expression solution hits a wall, try this super easy (and reliable) BeautifulSoup program.

from BeautifulSoup import BeautifulSoup

html = "<a> Keep me </a>"
soup = BeautifulSoup(html)

text_parts = soup.findAll(text=True)
text = ''.join(text_parts)

answered Mar 01 '09 at 02:00

Kenan Banks

207,056
34
155
173

BeautifulSoup hits the same wall too. See http://stackoverflow.com/questions/598817/python-html-removal/600471#600471 – jfs Mar 01 '09 at 20:46
1

[Beautiful Soup *4*](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) works e.g., `soup.get_text()` – jfs Apr 10 '18 at 08:07

score 10 · Answer 2 · answered Mar 01 '09 at 14:45

There is also a small library called stripogram which can be used to strip away some or all HTML tags.

You can use it like this:

from stripogram import html2text, html2safehtml
# Only allow <b>, <a>, <i>, <br>, and <p> tags
clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p"))
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces 
# and a page that's 80 characters wide.
text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80)

So if you want to simply strip out all HTML, you pass valid_tags=() to the first function.

You can find the documentation here.

score 7 · Accepted Answer · answered Feb 28 '09 at 22:43

7

You can use a regular expression to remove all the tags:

>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> re.sub('<[^>]*>', '', s)
'blah blah link'

answered Feb 28 '09 at 22:43

Luke Woodward

63,336
16
89
104

You can simplify your regex to '<.*?>' which will accomplish the same results, but this assumes properly formated HTML, as does yours. – UnkwnTech Feb 28 '09 at 22:45
Do you have to check for quoted >, or are those not allowed? Can you have or something? – Daniel LeCheminant Feb 28 '09 at 22:45
@Unkwntech: I prefer <[^>]*> over <.*?> since the former does not need to keep backtracking to find the end of the tag. – Luke Woodward Feb 28 '09 at 22:50
@Daniel L: Ideally, >s in attributes should be replaced with >. It is possible to modify the above regexp to ignore >s in attributes, but I'll leave that as an exercise for the interested reader. – Luke Woodward Feb 28 '09 at 23:02
1

That's not going to work well with things like "line1
line2", newlines or double spaces etc. It also won't decode HTML entities. Quick and dirty might be good enough, but to really do this right you're going to need to use a rea HTML library like BeautifulSoup or lxml. – gerdemb Mar 01 '09 at 01:35
Why not r'<[^>]+>'? There is no '<>' tag in HTML. – jfs Mar 01 '09 at 20:45
@J.F. Sebastian: I don't see that it makes a difference worth worrying about. – Luke Woodward Mar 01 '09 at 21:27
it may fail on valid html. [Is ">" (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?](https://stackoverflow.com/q/94528/4279) – jfs Apr 10 '18 at 08:00

score 5 · Answer 4 · edited May 23 '17 at 11:45

Regexs, BeautifulSoup, html2text don't work if an attribute has '>' in it. See Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

'HTML/XML parser'-based solution might help in such cases e.g., stripogram suggested by @MrTopf does work.

Here's ElementTree-based solution:

####from xml.etree import ElementTree as etree # stdlib
from lxml import etree

str_ = 'blah blah <a href="blah">link</a> END'
root = etree.fromstring('<html>%s</html>' % str_)
print ''.join(root.itertext()) # lxml or ElementTree 1.3+

Output:

blah blah link END

score 3 · Answer 5 · answered Feb 28 '09 at 22:52

3

Try Beautiful Soup. Throw away everything except the text.

answered Feb 28 '09 at 22:52

George V. Reilly

15,885
7
43
38

score 2 · Answer 6 · answered Mar 01 '09 at 18:38

2

html2text will do something like this.

answered Mar 01 '09 at 18:38

RexE

17,085
16
58
81

html2text is great for producing nicely formatted, readable output without an extra step. If all the HTML strings you need to convert are as simple as your example, then BeautifulSoup is the way to go. If more complex, html2text does a great job of preserving the readable intent of the original. – Jarret Hardie Mar 01 '09 at 21:20

score 1 · Answer 7 · edited Jun 29 '12 at 21:14

I just wrote this. I need it. It uses html2text and takes a file path, although I would prefer a URL. The output of html2text is stored in TextFromHtml2Text.text print it, store it, feed it to your pet canary.

import html2text
class TextFromHtml2Text:

    def __init__(self, url = ''):
        if url == '':
            raise TypeError("Needs a URL")
        self.text = ""
        self.url = url
        self.html = ""
        self.gethtmlfile()
        self.maytheswartzbewithyou()

    def gethtmlfile(self):
        file = open(self.url)
        for line in file.readlines():
            self.html += line

    def maytheswartzbewithyou(self):
        self.text = html2text.html2text(self.html)

You could also just write this as `import urllib, html2text[break]def get_text_from_html_url(url):[break] return html2text.html2text(urllib.urlopen(url).read())` shorter and cleaner — Jordan Reiter, Jun 29 '12 at 21:20

score 1 · Answer 8 · answered Jan 22 '13 at 17:31

There's a simple way to this:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

The idea is explained here: http://youtu.be/2tu9LTDujbw

You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s

PS - If you're interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It's free!

You're welcome! :)

score 0 · Answer 9 · answered Feb 28 '09 at 23:23

0

>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> q = re.compile(r'<.*?>', re.IGNORECASE)
>>> re.sub(q, '', s)
'blah blah link'

answered Feb 28 '09 at 23:23

riza

16,274
7
29
29

Python HTML removal

9 Answers9

Linked