Python lxml/beautiful soup to find all links on a web page

Question

I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href>'s from the html...

result = self._openurl(self.mainurl)
content = result.read()
html = lxml.html.fromstring(content)
print lxml.html.find_rel_links(html,'href')

this has been asked dozens of times and has good answers, e.g.: http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup — Benjamin Wohlwend, May 25 '11 at 21:29

score 15 · Accepted Answer · answered May 25 '11 at 21:27

15

Use XPath. Something like (can't test from here):

urls = html.xpath('//a/@href')

answered May 25 '11 at 21:27

Fred Foo

355,277
75
744
836

OK, then how can I get the 2 variables back from a string such as:
Economic & Name

Cmag

May 25 '11 at 21:37

Use `html.xpath('//a')` instead and then (off the top of my head) `.attr['href']` for the url and `.text` for the contents. – Fred Foo May 25 '11 at 21:53

where can i read more? how do you know the fields for xpath? reading http://www.w3schools.com/xpath/xpath_syntax.asp – Cmag May 25 '11 at 21:58

This will extract `href`s from ``s, only. [Iorien's solution](http://stackoverflow.com/a/6160418/906658) is a more general. – Bengt Feb 13 '13 at 20:49

score 5 · Answer 2 · edited Feb 13 '13 at 22:13

5

With iterlinks, lxml provides an excellent function for this task.

This yields (element, attribute, link, pos) for every link [...] in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute.

edited Feb 13 '13 at 22:13

Bengt

14,011
7
48
66

answered May 28 '11 at 07:55

Stack Exchange User

778
3
12

吳強福 · Answer 3 · 2011-12-24T07:52:15.600

2

I want to provide an alternative lxml-based solution.

The solution uses the function provided in lxml.cssselect

    import urllib
    import lxml.html
    from lxml.cssselect import CSSSelector
    connection = urllib.urlopen('http://www.yourTargetURL/')
    dom =  lxml.html.fromstring(connection.read())
    selAnchor = CSSSelector('a')
    foundElements = selAnchor(dom)
    print [e.get('href') for e in foundElements]

edited Dec 24 '11 at 07:52

answered Aug 16 '11 at 07:53

吳強福

373
1
2
18

love this answer, now I can use my css/js knowledge to find elements. for others, please note you have to pip install it separately now, but then the import works as above: https://lxml.de/cssselect.html – Alex L Jul 21 '21 at 22:25

score 1 · Answer 4 · answered Nov 16 '18 at 07:27

You can use this method:

from urllib.parse import urljoin, urlparse
from lxml import html as lh
class Crawler:
     def __init__(self, start_url):
         self.start_url = start_url
         self.base_url = f'{urlparse(self.start_url).scheme}://{urlparse(self.start_url).netloc}'
         self.visited_urls = set()

     def fetch_urls(self, html):
         urls = []
         dom = lh.fromstring(html)
         for href in dom.xpath('//a/@href'):
              url = urljoin(self.base_url, href)
              if url not in self.visited_urls and url.startswith(self.base_url):
                   urls.append(url)
         return urls

Python lxml/beautiful soup to find all links on a web page

4 Answers4

Linked