7

Is there an easy way to use the Python library html5lib to convert something like this:

<p>Hello World. Greetings from <strong>Mars.</strong></p>

to

Hello World. Greetings from Mars.
Jason Christa
  • 12,150
  • 14
  • 58
  • 85
  • 1
    If you are not stuck with poorly documented html5lib, http://stackoverflow.com/questions/2558056/how-can-i-parse-html-with-html5lib-and-query-the-parsed-html-with-xpath will help – Wolfgang Kuehn Dec 31 '11 at 00:31

3 Answers3

12

With lxml as the parser backend:

import html5lib

body = "<p>Hello World. Greetings from <strong>Mars.</strong></p>"
doc = html5lib.parse(body, treebuilder="lxml")
print doc.text_content()

To be honest, this is actually cheating, as it is equivalent to the following (only the relevant parts are changed):

from lxml import html
doc = html.fromstring(body)
print doc.text_content()

If you really want the html5lib parsing engine:

from lxml.html import html5parser
doc = html5parser.fromstring(body)
print doc.xpath("string()")
Niklas B.
  • 92,950
  • 18
  • 194
  • 224
  • Looks like you can call doc.text_content() to also accomplish this. – Jason Christa Dec 31 '11 at 00:59
  • 1
    @Niklas you can write that a shorter way without the join by just doing `doc.xpath('string()')`. Also, as a side-note, that is essentially what the `lxml.html.HtmlMixin` class does for the call to `text_content()` that @JasonChrista mentioned. – aculich Dec 31 '11 at 01:54
  • @aculich: Thanks for the information. Could come in handy some time :) I'm updating the question. – Niklas B. Dec 31 '11 at 01:55
  • 1
    @JasonChrista note that `text_content()` will only work in the case of `lxml.html`, but not for `lxml.html.html5parser`. I'm not sure if it is a bug or not, but the latter does not use `lxml.html.HtmlMixin` where `text_content()` is defined. Compare these two `lxml.html.fromstring('

    foo

    ').text_content()` versus `lxml.html.html5parser.fromstring('

    foo

    ').text_content()`
    – aculich Dec 31 '11 at 02:00
  • The html5lib one doesn't actually work, as aculich says. It also doesn't handle adding whitespace, like converting "a
    b

    c" to "a\nb\n\nc".

    – Glenn Maynard Apr 16 '14 at 15:49
4

I use html2text, which converts it to plain text (in Markdown format).

from html2text import HTML2Text
handler = HTML2Text()

html = """Lorem <i>ipsum</i> dolor sit amet, <b>consectetur adipiscing</b> elit.<br>
          <br><h1>Nullam eget \r\ngravida elit</h1>Integer iaculis elit at risus feugiat:
          <br><br><ul><li>Egestas non quis \r\nlorem.</li><li>Nam id lobortis felis.
          </li><li>Sed tincidunt nulla.</li></ul>
          At massa tempus, quis \r\nvehicula odio laoreet.<br>"""
text = handler.handle(html)

>>> text
u'Lorem _ipsum_ dolor sit amet, **consectetur adipiscing** elit.\n\n  \n\n# Nullam eget gravida elit\n\nInteger iaculis elit at risus feugiat:\n\n  \n\n  * Egestas non quis lorem.\n  * Nam id lobortis felis.\n  * Sed tincidunt nulla.\nAt massa tempus, quis vehicula odio laoreet.\n\n'
seddonym
  • 16,304
  • 6
  • 66
  • 71
  • Just tried it again, still working for me. What's the issue? – seddonym May 17 '14 at 12:51
  • thanks for followup.. this is what i got.. `Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. >>> ================================ RESTART ================================ >>> Traceback (most recent call last): File "D:/test/scraping/test.py", line 1, in from html2text import HTML2Text File "D:/test/scraping\html2text.py", line 5, in print doc.text_content() AttributeError: 'lxml.etree._ElementTree' object has no attribute 'text_content' >>> ` – ihightower May 17 '14 at 18:03
  • unfortunately.. the other code in this page didn't work too.. with same or similar error. sorry! i don't know how to fix the fault.. or to put back the vote down in its original place. – ihightower May 17 '14 at 18:06
  • 1
    Oh dear, don't worry. I think this issue is probably a bug that's not really related to this question. You could consider reinstalling the latest versions of all the libraries (or try it in a fresh virtualenv), otherwise maybe the issue belongs as a separate question. Good luck! – seddonym May 19 '14 at 11:38
  • it works now.. i re-installed python2.7. i noticed that some of the library .py files are changed by me accidentally.. and saved automatically (using pyscripter debug sessions). after re-install i have fresh set.. and the code works as expected. I really wish i can void the -1.. can you help me? – ihightower May 20 '14 at 13:09
  • 1
    why is this downvoted? using html2text is a perfectly good suggestion. – franklin Oct 16 '14 at 17:13
  • It was a mistake on the part of a commenter above - feel free to vote me up if you like :) – seddonym Oct 17 '14 at 07:59
1

You can concatenate the result of the itertext() method.

Example:

import html5lib
d = html5lib.parseFragment(
        '<p>Hello World. Greetings from <strong>Mars.</strong></p>')
s = ''.join(d.itertext())
print(s)

Output:

Hello World. Greetings from Mars.
maxschlepzig
  • 35,645
  • 14
  • 145
  • 182