How to use Beautiful Soup to extract link under tree structure

Question

Suppose the html web page like this:

<html>
    <div id="a">
        <div class="aa">
            <p>
                <a id="ff" href="#">ff</a>
                <a id="gg" href="#">gg</a>
            </p>
        </div>
        <div class="bb">
            <p>
                <a id="ff" href="#">ff</a>
            </p>
        </div>
    </div>
    <div id="b">
    </div>
</html>

After using

soup = BeautifulSoup(webpage.read())

I have the html web page, and I would like to get the the link that is under the tree structure: <html> -> <div id="a"> -> <div class="aa">.

How can I write the following Python code using Beautiful Soup?

It would be useful if you mentioned what you've tried so far. — Shawn Chin, Aug 28 '12 at 08:59

score 3 · Answer 1 · edited May 23 '17 at 11:58

Without more info about your data it is difficult to give you a concise solution that will cover all possible inputs. To help you on your way, here's a walkthrough which will hopefully lead you to a solution that suits your needs.

The following will give us <div id="a"> (there should only be one element with a specific id):

top_div = soup.find('div', {'id':'a'})

We can then proceed to retrieve all inner divs with class='aa' (possible to have more than one):

aa_div = top_div.findAll('div', {'class':'aa'})

From there, we can return all links for each div found:

links = [div.findAll('a') for div in aa_div]

Note that links contains a nested list since div.findAll('a') will return a list of a nodes found. There are various ways to flatten such a list.

Here's an example which iterates through the list and prints out the individual links:

>>> from itertools import chain
>>> for a in chain.from_iterable(links):
...   print a
... 
<a id="ff" href="#">ff</a>
<a id="gg" href="#">gg</a>

The solution presented above is rather long winded. However, with more understanding of the input data a much more compact solution is possible. For example, if the data is exactly as you've show and there will always be that one div with class='aa' then the solution could simply be:

>>> soup.find('div', {'class':'aa'}).findAll('a')
[<a id="ff" href="#">ff</a>, <a id="gg" href="#">gg</a>]

Using CSS selectors with BeautifulSoup4

If you're using a newer version of BeatifulSoup (version 4), you could also use the .select() method which provides CSS selector support. The elaborate solution I provided at the beginning of this answer could be re-written as:

soup.select("div#a div.aa a")

For BeautifulSoup v3, you can add on this functionality using soupselect.

However, do note the following statement from the docs (emphasis mine):

This is a convenience for users who know the CSS selector syntax. You can do all this stuff with the Beautiful Soup API. And if CSS selectors are all you need, you might as well use lxml directly, because it’s faster. But this lets you combine simple CSS selectors with the Beautiful Soup API.

@Kos I'm under the impression that you'd need something like [soupselect](http://code.google.com/p/soupselect/). Has css-style selectors been added to BeautifulSoup? Would be awesome if it has. — Shawn Chin, Aug 28 '12 at 11:11
Yup, at least partially. http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors The syntax is `soup.select(...)` — Kos, Aug 28 '12 at 11:17
Thanks. I've updated the question with a quick mention of `.select()` (BS4 only I believe, introduced in [this revision](http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/revision/180)). — Shawn Chin, Aug 28 '12 at 12:20

That1Guy · Answer 2 · 2012-08-30T15:02:09.277

I would go about it this way:

from BeautifulSoup import BeautifulSoup
import urllib

url = 'http://www.website.com'
file_pointer = urllib.urlopen(url)
html_object = BeautifulSoup(file_pointer)

link_list = []
links = html_object('div',{'class':'aa'})[0]('a')
for href in links:
    link_list.append(href['href'])

This returns a list of 'links' that can be called by offset:

link_1 = link_list[0]
link_2 = link_list[1]

Alternatively, if you want the text associated with the links (ie 'Click Here' vs '/Product/Store/Whatever.html'), you could change this same code very slightly and produce the desire results:

link_list = []
links = html_object('div',{'class':'aa'})[0]('a')
for text in links:
    link_list.append(text.contents[0])

Again, this will return a list so you will have to call the offsets:

link_1_text = link_list[0]
link_2_text = link_list[1]

score 2 · Answer 3 · answered Dec 26 '14 at 14:21

I have found this info on the official beautiful soup documentation:

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie

You can see more about beautiful soup here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Regards

How to use Beautiful Soup to extract link under tree structure

3 Answers3

Using CSS selectors with BeautifulSoup4