Use html5lib to convert an HTML fragment to plain text

Question

Is there an easy way to use the Python library html5lib to convert something like this:

<p>Hello World. Greetings from <strong>Mars.</strong></p>

to

Hello World. Greetings from Mars.

If you are not stuck with poorly documented html5lib, http://stackoverflow.com/questions/2558056/how-can-i-parse-html-with-html5lib-and-query-the-parsed-html-with-xpath will help — Wolfgang Kuehn, Dec 31 '11 at 00:31

Niklas B. · Accepted Answer · 2011-12-31T02:01:21.167

12

With lxml as the parser backend:

import html5lib

body = "<p>Hello World. Greetings from <strong>Mars.</strong></p>"
doc = html5lib.parse(body, treebuilder="lxml")
print doc.text_content()

To be honest, this is actually cheating, as it is equivalent to the following (only the relevant parts are changed):

from lxml import html
doc = html.fromstring(body)
print doc.text_content()

If you really want the html5lib parsing engine:

from lxml.html import html5parser
doc = html5parser.fromstring(body)
print doc.xpath("string()")

edited Dec 31 '11 at 02:01

answered Dec 31 '11 at 00:37

Niklas B.

92,950
18
194
224

Looks like you can call doc.text_content() to also accomplish this. – Jason Christa Dec 31 '11 at 00:59
1

@Niklas you can write that a shorter way without the join by just doing `doc.xpath('string()')`. Also, as a side-note, that is essentially what the `lxml.html.HtmlMixin` class does for the call to `text_content()` that @JasonChrista mentioned. – aculich Dec 31 '11 at 01:54
@aculich: Thanks for the information. Could come in handy some time :) I'm updating the question. – Niklas B. Dec 31 '11 at 01:55
1

@JasonChrista note that `text_content()` will only work in the case of `lxml.html`, but not for `lxml.html.html5parser`. I'm not sure if it is a bug or not, but the latter does not use `lxml.html.HtmlMixin` where `text_content()` is defined. Compare these two `lxml.html.fromstring('
foo
').text_content()` versus `lxml.html.html5parser.fromstring('
foo
').text_content()` – aculich Dec 31 '11 at 02:00
The html5lib one doesn't actually work, as aculich says. It also doesn't handle adding whitespace, like converting "a
b
c" to "a\nb\n\nc".
– Glenn Maynard Apr 16 '14 at 15:49

seddonym · Answer 2 · 2013-11-11T11:09:11.313

4

I use html2text, which converts it to plain text (in Markdown format).

from html2text import HTML2Text
handler = HTML2Text()

html = """Lorem <i>ipsum</i> dolor sit amet, <b>consectetur adipiscing</b> elit.<br>
          <br><h1>Nullam eget \r\ngravida elit</h1>Integer iaculis elit at risus feugiat:
          <br><br><ul><li>Egestas non quis \r\nlorem.</li><li>Nam id lobortis felis.
          </li><li>Sed tincidunt nulla.</li></ul>
          At massa tempus, quis \r\nvehicula odio laoreet.<br>"""
text = handler.handle(html)

>>> text
u'Lorem _ipsum_ dolor sit amet, **consectetur adipiscing** elit.\n\n  \n\n# Nullam eget gravida elit\n\nInteger iaculis elit at risus feugiat:\n\n  \n\n  * Egestas non quis lorem.\n  * Nam id lobortis felis.\n  * Sed tincidunt nulla.\nAt massa tempus, quis vehicula odio laoreet.\n\n'

edited Nov 11 '13 at 11:09

answered Nov 11 '13 at 10:58

seddonym

16,304
6
66
71

Just tried it again, still working for me. What's the issue? – seddonym May 17 '14 at 12:51
thanks for followup.. this is what i got.. `Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. >>> ================================ RESTART ================================ >>> Traceback (most recent call last): File "D:/test/scraping/test.py", line 1, in from html2text import HTML2Text File "D:/test/scraping\html2text.py", line 5, in print doc.text_content() AttributeError: 'lxml.etree._ElementTree' object has no attribute 'text_content' >>> ` – ihightower May 17 '14 at 18:03
unfortunately.. the other code in this page didn't work too.. with same or similar error. sorry! i don't know how to fix the fault.. or to put back the vote down in its original place. – ihightower May 17 '14 at 18:06
1

Oh dear, don't worry. I think this issue is probably a bug that's not really related to this question. You could consider reinstalling the latest versions of all the libraries (or try it in a fresh virtualenv), otherwise maybe the issue belongs as a separate question. Good luck! – seddonym May 19 '14 at 11:38
it works now.. i re-installed python2.7. i noticed that some of the library .py files are changed by me accidentally.. and saved automatically (using pyscripter debug sessions). after re-install i have fresh set.. and the code works as expected. I really wish i can void the -1.. can you help me? – ihightower May 20 '14 at 13:09
1

why is this downvoted? using html2text is a perfectly good suggestion. – franklin Oct 16 '14 at 17:13
It was a mistake on the part of a commenter above - feel free to vote me up if you like :) – seddonym Oct 17 '14 at 07:59

score 1 · Answer 3 · answered Apr 19 '17 at 16:34

You can concatenate the result of the itertext() method.

Example:

import html5lib
d = html5lib.parseFragment(
        '<p>Hello World. Greetings from <strong>Mars.</strong></p>')
s = ''.join(d.itertext())
print(s)

Output:

Hello World. Greetings from Mars.

Use html5lib to convert an HTML fragment to plain text

3 Answers3