1

I'm trying to build a web crawler using the requests module, and basically what I want it to do is go to a webpage, get all the href's and then write them to a text file.

So far my code looks like this:

def getLinks(url):
response = requests.get(url).text
soup = BeautifulSoup(response,"html.parser")
for link in soup.findAll("a"):
    print("Link:"+str(link.get("href")))

which work on some sites but the one that I'm trying to use it on the href's isn't full domain names like "www.google.com" instead they're like...paths to a directory that redirects to the link?

looks like this:

href="/out/101"

and if i try to write that in to a file it looks like this

 1. /out/101
 2. /out/102
 3. /out/103
 4. /out/104

which isn't really what I wanted.

soo how do I go about getting the domain names from these links?

Iron Fist
  • 10,739
  • 2
  • 18
  • 34

2 Answers2

4

This means that the URLs are relative to the current. To get the full URLs, use urljoin():

from urlparse import urljoin

for link in soup.findAll("a"): 
    full_url = urljoin(url, link.get("href"))
    print("Link:" + full_url)
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • ah right, but that only gives the the full url of the page that redirects to the actual site, but how do i get the url of the site that i get redirected to? :P –  Jan 04 '16 at 21:08
  • @stav make a request to it and get the `response.url`. If you would need to record the redirect chain, see http://stackoverflow.com/questions/20475552/python-requests-library-redirect-new-url. – alecxe Jan 04 '16 at 21:12
0

Try the below code. It will give you all the links from a website. If you know the base url of the website, then you can extract all the other urls from it. Whole web scraping code is here WebScrape

import requests
import lxml.html
from bs4 import BeautifulSoup

def extractLinks(url, base):
        '''
        Return links from the website
        :param url: Pass the url
        :param base: this is the base links
        :return: list of links
        '''
        links = [] #it will contain all the links from the website
        try:
            r = requests.get(url)
        except:
            return []
        obj = lxml.html.fromstring(r.text)
        potential_links = obj.xpath("//a/@href")
        links.append(r.url)
        #print potential_links
        for link in potential_links:
            if base in link:
                links.append(link)
            else:
                if link.startswith("http"):
                    links.append(link)

                elif base.endswith("/"):
                    if link.startswith("/"):
                        link = link.lstrip("/")
                        link = base + link
                    else:
                        link = base + link
                    links.append(link)

        return links

extractLinks('http://data-interview.enigmalabs.org/companies/',
    'http://data-interview.enigmalabs.org/')
python
  • 4,403
  • 13
  • 56
  • 103