7

How can I remove all HTML from a string in Python? For example, how can I turn:

blah blah <a href="blah">link</a>

into

blah blah link

Thanks!

user29772
  • 1,457
  • 7
  • 21
  • 25
  • Might be overkill for your purposes, but give BeautifulSoup a try if your strings have more complicated or malformed HTML. Caveat: I don't think it's available for Python 3.0 yet. – mechanical_meat Feb 28 '09 at 22:51

9 Answers9

18

When your regular expression solution hits a wall, try this super easy (and reliable) BeautifulSoup program.

from BeautifulSoup import BeautifulSoup

html = "<a> Keep me </a>"
soup = BeautifulSoup(html)

text_parts = soup.findAll(text=True)
text = ''.join(text_parts)
Kenan Banks
  • 207,056
  • 34
  • 155
  • 173
  • BeautifulSoup hits the same wall too. See http://stackoverflow.com/questions/598817/python-html-removal/600471#600471 – jfs Mar 01 '09 at 20:46
  • 1
    [Beautiful Soup *4*](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) works e.g., `soup.get_text()` – jfs Apr 10 '18 at 08:07
10

There is also a small library called stripogram which can be used to strip away some or all HTML tags.

You can use it like this:

from stripogram import html2text, html2safehtml
# Only allow <b>, <a>, <i>, <br>, and <p> tags
clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p"))
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces 
# and a page that's 80 characters wide.
text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80)

So if you want to simply strip out all HTML, you pass valid_tags=() to the first function.

You can find the documentation here.

MrTopf
  • 4,813
  • 2
  • 24
  • 19
7

You can use a regular expression to remove all the tags:

>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> re.sub('<[^>]*>', '', s)
'blah blah link'
Luke Woodward
  • 63,336
  • 16
  • 89
  • 104
  • You can simplify your regex to '<.*?>' which will accomplish the same results, but this assumes properly formated HTML, as does yours. – UnkwnTech Feb 28 '09 at 22:45
  • Do you have to check for quoted >, or are those not allowed? Can you have or something? – Daniel LeCheminant Feb 28 '09 at 22:45
  • @Unkwntech: I prefer <[^>]*> over <.*?> since the former does not need to keep backtracking to find the end of the tag. – Luke Woodward Feb 28 '09 at 22:50
  • @Daniel L: Ideally, >s in attributes should be replaced with >. It is possible to modify the above regexp to ignore >s in attributes, but I'll leave that as an exercise for the interested reader. – Luke Woodward Feb 28 '09 at 23:02
  • 1
    That's not going to work well with things like "line1
    line2", newlines or double spaces etc. It also won't decode HTML entities. Quick and dirty might be good enough, but to really do this right you're going to need to use a rea HTML library like BeautifulSoup or lxml.
    – gerdemb Mar 01 '09 at 01:35
  • Why not r'<[^>]+>'? There is no '<>' tag in HTML. – jfs Mar 01 '09 at 20:45
  • @J.F. Sebastian: I don't see that it makes a difference worth worrying about. – Luke Woodward Mar 01 '09 at 21:27
  • it may fail on valid html. [Is ">" (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?](https://stackoverflow.com/q/94528/4279) – jfs Apr 10 '18 at 08:00
5

Regexs, BeautifulSoup, html2text don't work if an attribute has '>' in it. See Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

'HTML/XML parser'-based solution might help in such cases e.g., stripogram suggested by @MrTopf does work.

Here's ElementTree-based solution:

####from xml.etree import ElementTree as etree # stdlib
from lxml import etree

str_ = 'blah blah <a href="blah">link</a> END'
root = etree.fromstring('<html>%s</html>' % str_)
print ''.join(root.itertext()) # lxml or ElementTree 1.3+

Output:

blah blah link END
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
3

Try Beautiful Soup. Throw away everything except the text.

George V. Reilly
  • 15,885
  • 7
  • 43
  • 38
2

html2text will do something like this.

RexE
  • 17,085
  • 16
  • 58
  • 81
  • html2text is great for producing nicely formatted, readable output without an extra step. If all the HTML strings you need to convert are as simple as your example, then BeautifulSoup is the way to go. If more complex, html2text does a great job of preserving the readable intent of the original. – Jarret Hardie Mar 01 '09 at 21:20
1

I just wrote this. I need it. It uses html2text and takes a file path, although I would prefer a URL. The output of html2text is stored in TextFromHtml2Text.text print it, store it, feed it to your pet canary.

import html2text
class TextFromHtml2Text:

    def __init__(self, url = ''):
        if url == '':
            raise TypeError("Needs a URL")
        self.text = ""
        self.url = url
        self.html = ""
        self.gethtmlfile()
        self.maytheswartzbewithyou()

    def gethtmlfile(self):
        file = open(self.url)
        for line in file.readlines():
            self.html += line

    def maytheswartzbewithyou(self):
        self.text = html2text.html2text(self.html)
Jordan Reiter
  • 20,467
  • 11
  • 95
  • 161
  • You could also just write this as `import urllib, html2text[break]def get_text_from_html_url(url):[break] return html2text.html2text(urllib.urlopen(url).read())` shorter and cleaner – Jordan Reiter Jun 29 '12 at 21:20
1

There's a simple way to this:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

The idea is explained here: http://youtu.be/2tu9LTDujbw

You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s

PS - If you're interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It's free!

You're welcome! :)

Igor Medeiros
  • 4,026
  • 2
  • 26
  • 32
0
>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> q = re.compile(r'<.*?>', re.IGNORECASE)
>>> re.sub(q, '', s)
'blah blah link'
riza
  • 16,274
  • 7
  • 29
  • 29