How to get the link inside the li tag?

Question

I have this code:

import urllib
from bs4 import BeautifulSoup
url = "http://download.cnet.com/windows/"
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
for a in soup.select("div.catFlyout a[href]"):
    print "http://download.cnet.com"+a["href"]

But this code did not give the correct output. The correct output should be like this:

http://download.cnet.com/windows/security-software/
http://download.cnet.com/windows/browsers/
http://download.cnet.com/windows/business-software/
..
..
http://download.cnet.com/windows/video-software/

score 1 · Answer 1 · edited May 23 '17 at 12:11

1

There are some relative and absolute links in the list, prepend base url only if the link starts with http:

for a in soup.select("div.catFlyout a[href]"):
    if not a["href"].startswith("http"):
        print "http://download.cnet.com"+a["href"]
    else:
        print a["href"]

Or, use urlparse to check if link is absolute or not (taken from here):

import urllib
import urlparse
from bs4 import BeautifulSoup

def is_absolute(url):
    return bool(urlparse.urlparse(url).scheme)

url = "http://download.cnet.com/windows/"
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
for a in soup.select("div.catFlyout a[href]"):
    if not is_absolute(a['href']):
        print "http://download.cnet.com"+a["href"]
    else:
        print a["href"]

edited May 23 '17 at 12:11

Community

1
1

answered Sep 09 '13 at 09:53

alecxe

462,703
120
1,088
1,195

How if want to take only the category link,not all sub inside the category? @alexce – wan mohd payed Sep 10 '13 at 02:05
@wanmohdpayed what do you mean by category links? Links that doesn't end with `html`? – alecxe Sep 10 '13 at 09:35
I mean I want only the link under the category. I dont want the sub link inside the link under the category. @alexce – wan mohd payed Sep 11 '13 at 02:24

How to get the link inside the li tag?

1 Answers1