60

I'm trying to convert a chunk of HTML text with BeautifulSoup. Here is an example:

<div>
    <p>
        Some text
        <span>more text</span>
        even more text
    </p>
    <ul>
        <li>list item</li>
        <li>yet another list item</li>
    </ul>
</div>
<p>Some other text</p>
<ul>
    <li>list item</li>
    <li>yet another list item</li>
</ul>

I tried doing something like:

def parse_text(contents_string)
    Newlines = re.compile(r'[\r\n]\s+')
    bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
    txt = bs.getText('\n')
    return Newlines.sub('\n', txt)

...but that way my span element is always on a new line. This is of course a simple example. Is there a way to get the text in the HTML page as the way it will be rendered in the browser (no css rules required, just the regular way div, span, li, etc. elements are rendered) in Python?

btatarov
  • 657
  • 1
  • 5
  • 8
  • Show us what the expected output looks like? You want to strip all the indenting whitespace, and newlines, right? – smci Dec 29 '18 at 07:54

2 Answers2

127

BeautifulSoup is a scraping library, so it's probably not the best choice for doing HTML rendering. If it's not essential to use BeautifulSoup, you should take a look at html2text. For example:

import html2text
html = open("foobar.html").read()
print html2text.html2text(html)

This outputs:

Some text more text even more text

  * list item
  * yet another list item

Some other text

  * list item
  * yet another list item
JosefAssad
  • 4,018
  • 28
  • 37
del
  • 6,341
  • 10
  • 42
  • 45
  • 2
    Can I use html2text in junction with BeautifulSoup. For example I parse the chunk of html I'm interested at and then feed it to html2text using pretify()? – btatarov Nov 12 '12 at 03:58
  • 2
    Yes, html2text can process HTML in chunks by calling `HTML2Text.feed(chunk)` on each successive chunk, and then calling `HTML2Text.close()` to get the text result (similar to [`HTMLParser.feed()`](http://docs.python.org/2/library/htmlparser.html#HTMLParser.HTMLParser.feed)). – del Nov 12 '12 at 04:40
  • 40
    This answer made me happy and sad at the same time. RIP Aaron Swartz. – Steve Rossiter Jan 16 '16 at 00:37
  • 5
    Remember to check whether `html2text` complies with your licensing policy as it is distributed under *GPLv3*. – Pawel Kam Jul 04 '20 at 21:19
  • 1
    html2text convert the html string to the markdown string. So the library may not meet everyone's needs, Some one may not want markdown tag appear int the result. such as me. – vipcxj Feb 03 '21 at 07:43
5

I was encountering the same problem trying to parse the rendered HTML. Basically it seems that BS is not the ideal package for this. @Del gives the great html2text solution.

On a differet SO question: BeautifulSoup get_text does not strip all tags and JavaScript @Helge mentioned using nltk. Unfortunately nltk appears to be discontinuing this method.

I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data...

Answer from @Helge (nltk).

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

Answer above from @del

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop
Community
  • 1
  • 1
Paul
  • 7,155
  • 8
  • 41
  • 40
  • 12
    nltk.clean_html gives `NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function` – Martin Thoma Jan 20 '15 at 08:26
  • 2
    Even if you happen to have an old version of nltk, don't use this function. It's fast because it processes html with regexes: https://github.com/nltk/nltk/blob/e86e83b1e2219fb099c4fbcff89a4ae07cd14868/nltk/util.py#L333-L353 – digenishjkl Jan 12 '16 at 13:05
  • 1
    I added an answer on a related question which gives a way to strip JavaScript via BeautifulSoup: https://stackoverflow.com/a/47782943/2112722 – Sarah Messer Dec 12 '17 at 23:36