Questions tagged [html5lib]

html5lib is a library for parsing and serializing HTML documents and fragments in Python, with ports to Dart, PHP, and Ruby.

html5lib is an open-source HTML parser for Python, based on the HTML specification. There are ports for PHP and Ruby (both unmaintained), as well as a third-party one for Dart.

107 questions
67
votes
8 answers

beautifulsoup, html5lib: module object has no attribute _base

When I updated my packages I have this new error: class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder): AttributeError: 'module' object has no attribute '_base' I tried to update beautifulsoup, with no more result. How can I fix…
Ehvince
  • 17,274
  • 7
  • 58
  • 79
41
votes
9 answers

Don't put html, head and body tags automatically, beautifulsoup

I'm using beautifulsoup with html5lib, it puts the html, head and body tags automatically: BeautifulSoup('

FOO

', 'html5lib') # =>

FOO

Is there any option that I can set, turn off this behavior…
Bengineer
  • 7,264
  • 7
  • 27
  • 28
22
votes
2 answers

BeautifulSoup - how should I obtain the body contents

I'm parsing HTML with BeautifulSoup. At the end, I would like to obtain the body contents, but without the body tags. But BeautifulSoup adds html, head, and body tags. I this googlegrops discussion one possible solution is proposed: >>> from bs4…
Philipp Zedler
  • 1,660
  • 1
  • 17
  • 36
20
votes
7 answers

How can I parse HTML with html5lib, and query the parsed HTML with XPath?

I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a…
Dan.StackOverflow
  • 1,279
  • 4
  • 18
  • 28
19
votes
3 answers

Error in reading html to data frame in Python “html5lib not found”

I've come accross the following error about html5lib when trying to read an html data frame. Here is the code: !pip install html5lib !pip install lxml !pip install beautifulSoup4 import html5lib import lxml from bs4 import BeautifulSoup table_list…
J. Serra
  • 440
  • 1
  • 4
  • 13
9
votes
1 answer

Convert lxml _Element to HtmlElement

For various reasons I'm trying to switch from lxml.html.fromstring() to lxml.html.html5parser.document_fromstring(). The big difference between the two is that the first returns an lxml.html.HtmlElement, and the second returns an…
mlissner
  • 17,359
  • 18
  • 106
  • 169
9
votes
1 answer

difference between lxml and html5lib in the context of beautifulsoup

Is there a difference between the capabiities of lxml and html5lib parsers in the context of beautifulsoup? I am trying to learn to use BS4 and using the following code construct -- ret = requests.get('http://www.olivegarden.com') soup =…
R11
  • 405
  • 2
  • 6
  • 15
8
votes
3 answers

AttributeError: module 'html5lib.treebuilders.etree' has no attribute 'getETreeModule'

Suggestions please, thanks :) pip list --outdated --format=freeze Gives the following error: ERROR: Exception: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/pip/_internal/cli/base_command.py", line 223, in _main …
dewijones92
  • 1,319
  • 2
  • 24
  • 45
8
votes
3 answers

Obtaining position info when parsing HTML in Python

I'm trying to find a way to parse (potentially malformed) HTML in Python and, if a set of conditions are met, output that piece of the document with the position (line, column). The position information is what is tripping me up here. And to be…
Waylan
  • 37,164
  • 12
  • 83
  • 109
7
votes
3 answers

Use html5lib to convert an HTML fragment to plain text

Is there an easy way to use the Python library html5lib to convert something like this:

Hello World. Greetings from Mars.

to Hello World. Greetings from Mars.
Jason Christa
  • 12,150
  • 14
  • 58
  • 85
7
votes
2 answers

BeautifulSoup - lxml and html5lib parsers scraping differences

I am using BeautifulSoup 4 with Python 2.7. I would like to extract certain elements from a website (Quantities, see the example bellow). For some reason, the lxml parser doesn't allow me to extract all of the desired elements from the page. It…
LaGuille
  • 1,658
  • 5
  • 20
  • 37
5
votes
2 answers

Remove contents of tags using html5lib or bleach

I've been using the excellent bleach library for removing bad HTML. I've got a load of HTML documents which have been pasted in from Microsoft Word, and contain things like: Using bleach (with the…
Dominic Rodger
  • 97,747
  • 36
  • 197
  • 212
5
votes
2 answers

transport_encoding error during installing with pip

I'm getting unexpected arg: keyword encoding in parse() while trying to install any python package through pip. I'm getting this problem since i installed tensorflow for python 3.6, which probably led to some issue with html5lib and setuptools.…
Itachi
  • 2,817
  • 27
  • 35
5
votes
1 answer

html5lib installed but BeautifulSoup cannot find it

I have installed the html5lib package. I'm sure because when i try to install it, i get a message that it is already installed. pip install html5lib Requirement already satisfied: html5lib in ./anaconda/lib/python3.5/site-packages Also i am able to…
Parikshit Bhinde
  • 475
  • 8
  • 15
5
votes
1 answer

ImportError: No module named base in html5lib

I suddenly can't start may Django server any more, running check: python manage.py check shows to the following error: apps.populate(settings.INSTALLED_APPS) File…
Aymen Gasmi
  • 474
  • 4
  • 13
1
2 3 4 5 6 7 8