Questions tagged [html5lib]

html5lib is a library for parsing and serializing HTML documents and fragments in Python, with ports to Dart, PHP, and Ruby.

html5lib is an open-source HTML parser for Python, based on the HTML specification. There are ports for PHP and Ruby (both unmaintained), as well as a third-party one for Dart.

107 questions

votes

8 answers

beautifulsoup, html5lib: module object has no attribute _base

When I updated my packages I have this new error: class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder): AttributeError: 'module' object has no attribute '_base' I tried to update beautifulsoup, with no more result. How can I fix…

beautifulsoup html5lib

asked Jul 19 '16 at 00:14

Ehvince

17,274
7
58
79

votes

9 answers

Don't put html, head and body tags automatically, beautifulsoup

I'm using beautifulsoup with html5lib, it puts the html, head and body tags automatically: BeautifulSoup('

FOO

', 'html5lib') # =>

FOO

Is there any option that I can set, turn off this behavior…

python beautifulsoup html5lib

asked Feb 11 '13 at 22:33

Bengineer

7,264
7
27
28

votes

2 answers

BeautifulSoup - how should I obtain the body contents

I'm parsing HTML with BeautifulSoup. At the end, I would like to obtain the body contents, but without the body tags. But BeautifulSoup adds html, head, and body tags. I this googlegrops discussion one possible solution is proposed: >>> from bs4…

python django beautifulsoup html5lib

asked Jan 30 '14 at 09:44

Philipp Zedler

1,660
1
17
36

votes

7 answers

How can I parse HTML with html5lib, and query the parsed HTML with XPath?

I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a…

python parsing xpath lxml html5lib

asked Apr 01 '10 at 04:04

Dan.StackOverflow

1,279
4
18
28

votes

3 answers

Error in reading html to data frame in Python “html5lib not found”

I've come accross the following error about html5lib when trying to read an html data frame. Here is the code: !pip install html5lib !pip install lxml !pip install beautifulSoup4 import html5lib import lxml from bs4 import BeautifulSoup table_list…

python-2.7 pandas dataframe html5lib

asked Mar 01 '18 at 03:36

J. Serra

votes

1 answer

Convert lxml _Element to HtmlElement

For various reasons I'm trying to switch from lxml.html.fromstring() to lxml.html.html5parser.document_fromstring(). The big difference between the two is that the first returns an lxml.html.HtmlElement, and the second returns an…

lxml html5lib

asked Oct 14 '15 at 20:04

mlissner

17,359
18
106
169

votes

1 answer

difference between lxml and html5lib in the context of beautifulsoup

Is there a difference between the capabiities of lxml and html5lib parsers in the context of beautifulsoup? I am trying to learn to use BS4 and using the following code construct -- ret = requests.get('http://www.olivegarden.com') soup =…

python beautifulsoup lxml html5lib

asked Sep 03 '13 at 00:44

R11

votes

3 answers

AttributeError: module 'html5lib.treebuilders.etree' has no attribute 'getETreeModule'

Suggestions please, thanks :) pip list --outdated --format=freeze Gives the following error: ERROR: Exception: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/pip/_internal/cli/base_command.py", line 223, in _main …

python pip html5lib

asked Sep 30 '21 at 10:13

dewijones92

1,319
2
24
45

votes

3 answers

Obtaining position info when parsing HTML in Python

I'm trying to find a way to parse (potentially malformed) HTML in Python and, if a set of conditions are met, output that piece of the document with the position (line, column). The position information is what is tripping me up here. And to be…

python html parsing lxml html5lib

asked Feb 25 '15 at 20:01

Waylan

37,164
12
83
109

votes

3 answers

Use html5lib to convert an HTML fragment to plain text

Is there an easy way to use the Python library html5lib to convert something like this:

Hello World. Greetings from Mars.

to Hello World. Greetings from Mars.

python html html5lib

asked Dec 31 '11 at 00:19

Jason Christa

12,150
14
58
85

votes

2 answers

BeautifulSoup - lxml and html5lib parsers scraping differences

I am using BeautifulSoup 4 with Python 2.7. I would like to extract certain elements from a website (Quantities, see the example bellow). For some reason, the lxml parser doesn't allow me to extract all of the desired elements from the page. It…

python web-scraping beautifulsoup lxml html5lib

asked Mar 27 '14 at 19:08

LaGuille

1,658
5
20
37

votes

2 answers

Remove contents of tags using html5lib or bleach

I've been using the excellent bleach library for removing bad HTML. I've got a load of HTML documents which have been pasted in from Microsoft Word, and contain things like: Using bleach (with the…

python django html5lib

asked Sep 24 '11 at 11:00

Dominic Rodger

97,747
36
197
212

votes

2 answers

transport_encoding error during installing with pip

I'm getting unexpected arg: keyword encoding in parse() while trying to install any python package through pip. I'm getting this problem since i installed tensorflow for python 3.6, which probably led to some issue with html5lib and setuptools.…

python-3.x pip setuptools html5lib

asked Oct 02 '17 at 15:28

Itachi

2,817
27
35

votes

1 answer

html5lib installed but BeautifulSoup cannot find it

I have installed the html5lib package. I'm sure because when i try to install it, i get a message that it is already installed. pip install html5lib Requirement already satisfied: html5lib in ./anaconda/lib/python3.5/site-packages Also i am able to…

python beautifulsoup html5lib

asked Sep 20 '17 at 16:40

Parikshit Bhinde

votes

1 answer

ImportError: No module named base in html5lib

I suddenly can't start may Django server any more, running check: python manage.py check shows to the following error: apps.populate(settings.INSTALLED_APPS) File…

python django importerror requirements.txt html5lib

asked Mar 10 '17 at 10:14

Aymen Gasmi

2 3 4 5 6 7 8 Next