Questions tagged [lxml.html]

lxml.html is a dedicated python package for dealing with HTML.

lxml.html is a dedicated python package for dealing with HTML. It is based on lxml's HTML parser, but provides a special Element API for HTML elements, as well as a number of utilities for common HTML processing tasks.

159 questions
21
votes
1 answer

How can I preserve
as newlines with lxml.html text_content() or equivalent?

I want to preserve
tags as \n when extracting the text content from lxml elements. Example code: fragment = '
This is a text node.
This is another text node.

And a child element.Another child,
with two…
extempo
  • 213
  • 2
  • 6
18
votes
2 answers

Extending CSS selectors in BeautifulSoup

The Question: BeautifulSoup provides a very limited support for CSS selectors. For instance, the only supported pseudo-class is nth-of-type and it can only accept numerical values - arguments like even or odd are not allowed. Is it possible to…
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
14
votes
3 answers

Type hints for lxml?

New to Python and come from a statically typed language background. I want type hints for https://lxml.de just for ease of development (mypy flagging issues and suggesting methods would be nice!) To my knowledge, this is a python 2.0 module and…
Ian
  • 301
  • 3
  • 6
14
votes
4 answers

How to use Cleaner, lxml.html without returning div tag?

I have this code: evil = "bold textitalic text" cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'], page_structure=True) print cleaner.clean_html(evil) I expected…
Allan Veloso
  • 5,823
  • 1
  • 38
  • 36
13
votes
1 answer

How to preserve inline CSS style with lxml.html.clean.Cleaner() in Python?

I am trying to clean up an HTML table using lxml.html.clean.Cleaner(). I need to strip JavaScript attributes, but would like to preserve inline CSS style. I thought style=False is the default setup: import lxml.html.clean cleaner =…
laviex
  • 593
  • 7
  • 13
9
votes
1 answer

Python Print element from lxml html

Trying to print out the entire element retrieved from lxml. from lxml import html import requests page=requests.get("http://finance.yahoo.com/q?s=INTC") qtree = html.fromstring(page.content) quote =…
Kevin
  • 93
  • 1
  • 4
9
votes
2 answers

How to fix issue with the removed cssselect package in lxml?

So they removed the cssselect package from lxml.. Now my python program is useless. I just can't figure out how I could get it working: ImportError: cssselect seems not to be installed. See http://packages.python.org/cssselect/ I've tried to copy…
kamilla
  • 361
  • 1
  • 3
  • 8
8
votes
1 answer

Python Xpath: lxml.etree.XPathEvalError: Invalid predicate

I'm trying to learn how to scrape web pages and in the tutorial I'm using the code below is throwing this error: lxml.etree.XPathEvalError: Invalid predicate The website I'm querying is (don't judge me, it was the one used in the training vid :/ ):…
Michael Martinez
  • 171
  • 1
  • 3
  • 10
7
votes
2 answers

Why am I getting this ImportError?

I have a tkinter app that I am compiling to an .exe via py2exe. In the setup file, I have set it to include lxml, urllib, lxml.html, ast, and math. When I run python setup.py py2exe in a CMD console, it compiles fine. I then go to the dist folder It…
Zach Gates
  • 273
  • 2
  • 5
  • 11
6
votes
1 answer

How to rename a node with Python LXML?

How do I rename a node using LXML? Specifically, how to rename a parent node i.e. a tag while preserving all the underlying structure? I am parsing using the lxml.html module but supposedly there shouldn't be any difference between xml and…
ccpizza
  • 28,968
  • 18
  • 162
  • 169
6
votes
1 answer

printing html entities using lxml in python

I'm trying to make a div element from the below string with html entities. Since my string contains html entities, & reserved char in the html entity is being escaped as & in the output. Thus html entities are displayed as plain text. How can I…
ravi
  • 838
  • 1
  • 12
  • 25
5
votes
1 answer

lxml.html. Error reading file; Failed to load external entity

I am trying to get a movie trailer url from YouTube using parsing with lxml.html: from lxml import html import lxml.html from lxml.etree import XPath def get_youtube_trailer(selected_movie): # Create the url for the YouTube query in order to find…
alekscp
  • 53
  • 2
  • 8
5
votes
1 answer

href attribute for lxml.html

according to this answer: >>> from lxml.html import fromstring >>> s = """""" >>> doc = fromstring(s) >>> doc.value '1234' >>> doc.name 'question' I tried to get both the link and the text from this…
nazmus saif
  • 167
  • 1
  • 2
  • 12
5
votes
2 answers

How to remove insignificant whitespace in lxml.html?

I'm rather surprised that lxml.html leaves insignificant whitespace when parsing HTML by default. I'm also surprised that I can't find any obvious way to make it not do that. Python 2.7.3 (default, Apr 10 2013, 06:20:15) [GCC 4.6.3] on linux2 Type…
Mark E. Haase
  • 25,965
  • 11
  • 66
  • 72
5
votes
1 answer

parse html body fragment in lxml

I'm trying to parse a fragment of html:

title

I use lxml.html.fromstring. And it is driving me insane because it keeps stripping the tag of my fragments: >…
fserb
  • 4,004
  • 2
  • 26
  • 23
1
2 3
10 11