10

I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href>'s from the html...

result = self._openurl(self.mainurl)
content = result.read()
html = lxml.html.fromstring(content)
print lxml.html.find_rel_links(html,'href')
halfer
  • 19,824
  • 17
  • 99
  • 186
Cmag
  • 14,946
  • 25
  • 89
  • 140
  • 1
    this has been asked dozens of times and has good answers, e.g.: http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup – Benjamin Wohlwend May 25 '11 at 21:29

4 Answers4

15

Use XPath. Something like (can't test from here):

urls = html.xpath('//a/@href')
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Use `html.xpath('//a')` instead and then (off the top of my head) `.attr['href']` for the url and `.text` for the contents. – Fred Foo May 25 '11 at 21:53
  • where can i read more? how do you know the fields for xpath? reading http://www.w3schools.com/xpath/xpath_syntax.asp – Cmag May 25 '11 at 21:58