get all the links of HTML using lxml

Question

I want to find out all the urls and its name from a html page using lxml.

I can parse the url and can find out this thing but is there any easy way from which I can find all the url links using lxml?

Note that HTML is not XML; if you have trouble with parsing because of missing end elements or missing quotes around attribute values, [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) can help or might be better suited. — Aaron Digulla, Apr 30 '12 at 12:20

score 8 · Answer 1 · answered Apr 30 '12 at 12:08

8

from lxml.html import parse
dom = parse('http://www.google.com/').getroot()
links = dom.cssselect('a')

answered Apr 30 '12 at 12:08

kev

155,172
47
273
272

1

Great answer, just had to do a `pip install cssselect` to kick things off. – taystack Feb 24 '15 at 21:03

score 2 · Answer 2 · answered Jan 23 '14 at 19:06

2

from lxml import etree, cssselect, html

with open("/you/path/index.html", "r") as f:
    fileread = f.read()

dochtml = html.fromstring(fileread)

select = cssselect.CSSSelector("a")
links = [ el.get('href') for el in select(dochtml) ]

links = iter(links)
for n, l in enumerate(links):
    print n, l

answered Jan 23 '14 at 19:06

lmokto

131
9

1

Note that cssselect is now a standalone project and doesn't come with lxml anymore. Install with `pip install cssselect`. Go [here](https://pythonhosted.org/cssselect/) for more information. – jheyse Sep 05 '14 at 23:17

get all the links of HTML using lxml

2 Answers2

Linked