retrieve links from web page using python and BeautifulSoup

Question

How can I retrieve the links of a webpage and copy the url address of the links using Python?

Here's an updated code snippet that does exactly what you're asking for in 30 lines. https://github.com/mujeebishaque/extract-urls — Mujeeb Ishaque, Apr 25 '21 at 13:10
I tried this for a link and got outputs like this```/info-service/downloads/#unserekataloge'``` . Is it not possible to get the full accessible link? and not just part of the sub link? I want to get links to all pdfs available on the website @MujeebIshaque — x89, Jul 01 '21 at 17:36

score 240 · Accepted Answer · edited Jul 22 '20 at 01:22

240

Here's a short snippet using the SoupStrainer class in BeautifulSoup:

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Edit: Note that I used the SoupStrainer class because it's a bit more efficient (memory and speed wise), if you know what you're parsing in advance.

edited Jul 22 '20 at 01:22

Bryce Guinta

3,456
1
35
36

answered Jul 03 '09 at 18:53

ars

120,335
23
147
134

13

+1, using the soup strainer is a great idea because it allows you to circumvent a lot of unnecessary parsing when all you're after are the links. – Evan Fosmark Jul 03 '09 at 18:57
I edited to add a similar explanation before I saw Evan's comment. Thanks for noting that, though! – ars Jul 03 '09 at 19:01
thanks, this solve my problem, with this I finish my proyect thanks a lot – NepUS Jul 03 '09 at 21:17
4

Heads up: `/usr/local/lib/python2.7/site-packages/bs4/__init__.py:128: UserWarning: The "parseOnlyThese" argument to the BeautifulSoup constructor has been renamed to "parse_only."` – BenDundee Feb 19 '13 at 14:11
37

On version 3.2.1 of BeautifulSoup there is no `has_attr`. Instead I see there is something called `has_key` and it works. – Oct 26 '13 at 21:01
1

@NeoVe you could just use `hasattr`, a Python builtin: `hasattr(link, "href")` – cat Mar 25 '16 at 02:50
2

Update for python3 – john doe Apr 06 '17 at 04:59
7

from bs4 import BeautifulSoup. (not from BeautifulSoup import BeautifulSoup..) correction needed. – Rishabh Agrahari May 11 '17 at 08:56
7

Updated code for python3 and latest bs4 - https://gist.github.com/PandaWhoCodes/7762fac08c4ed005cec82204d7abd61b – Ashish Cherian Sep 30 '19 at 06:01
2

Result: AttributeError: 'Doctype' object has no attribute 'has_attr' – zabop Oct 31 '20 at 17:57
I think it skips `a` tags with children :( How can I fix it? – Samuel Nihoul Jun 06 '22 at 08:20

Martijn Pieters · Answer 2 · 2020-05-18T17:26:10.933

For completeness sake, the BeautifulSoup 4 version, making use of the encoding supplied by the server as well:

from bs4 import BeautifulSoup
import urllib.request

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

or the Python 2 version:

from bs4 import BeautifulSoup
import urllib2

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

and a version using the requests library, which as written will work in both Python 2 and 3:

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

The soup.find_all('a', href=True) call finds all <a> elements that have an href attribute; elements without the attribute are skipped.

BeautifulSoup 3 stopped development in March 2012; new projects really should use BeautifulSoup 4, always.

Note that you should leave decoding the HTML from bytes to BeautifulSoup. You can inform BeautifulSoup of the characterset found in the HTTP response headers to assist in decoding, but this can be wrong and conflicting with a <meta> header info found in the HTML itself, which is why the above uses the BeautifulSoup internal class method EncodingDetector.find_declared_encoding() to make sure that such embedded encoding hints win over a misconfigured server.

With requests, the response.encoding attribute defaults to Latin-1 if the response has a text/* mimetype, even if no characterset was returned. This is consistent with the HTTP RFCs but painful when used with HTML parsing, so you should ignore that attribute when no charset is set in the Content-Type header.

Is there something like StrainedSoup for bs4? (I don't need it now but just wondering, if there is you'd might want to add that) — Antti Haapala -- Слава Україні, Feb 02 '17 at 07:07
@AnttiHaapala: `SoupStrainer` you mean? It [didn't go anywhere, it is still part of the project](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#soupstrainer). — Martijn Pieters, Feb 02 '17 at 07:39
Is there a reason this code doesn't pass "features=" to the BeautifulSoup constructor? BeautifulSoup gives me a warning about using a default parser. — MikeB, May 12 '20 at 15:06
@MikeB: when I wrote this answer BeautifulSoup didn't yet raise a warning if you didn't. — Martijn Pieters, May 18 '20 at 17:23

score 54 · Answer 3 · edited Mar 11 '13 at 06:34

Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

lxml.html also supports CSS3 selectors so this sort of thing is trivial.

An example with lxml and xpath would look like this:

import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
    print link

BeautifulSoup 4 will use `lxml` as the default parser if installed. — Martijn Pieters, Dec 28 '14 at 12:29

Andrew Johnson · Answer 4 · 2009-07-03T19:27:14.340

38

import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
  if 'national-park' in a['href']:
    print 'found a url with national-park in the link'

edited Jul 03 '09 at 19:27

answered Jul 03 '09 at 18:37

Andrew Johnson

13,108
13
75
116

3

This solved a problem I had with my code. Thank you! – R J Aug 15 '18 at 17:52

score 12 · Answer 5 · edited May 10 '19 at 17:52

12

The following code is to retrieve all the links available in a webpage using urllib2 and BeautifulSoup4:

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)

for line in soup.find_all('a'):
    print(line.get('href'))

edited May 10 '19 at 17:52

daaawx

3,273
2
17
16

answered Feb 07 '14 at 14:17

Sentient07

1,270
1
16
24

QHarr · Answer 6 · 2022-04-05T21:54:57.423

Links can be within a variety of attributes so you could pass a list of those attributes to select.

For example, with src and href attributes (here I am using the starts with ^ operator to specify that either of these attributes values starts with http):

from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)

Attribute = value selectors

[attr^=value]

Represents elements with an attribute name of attr whose value is prefixed (preceded) by value.

There are also the commonly used $ (ends with) and * (contains) operators. For a full syntax list see the link above.

cheekybastard · Answer 7 · 2013-10-07T14:18:00.997

9

Under the hood BeautifulSoup now uses lxml. Requests, lxml & list comprehensions makes a killer combo.

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)

[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

In the list comp, the "if '//' and 'url.com' not in x" is a simple method to scrub the url list of the sites 'internal' navigation urls, etc.

edited Oct 07 '13 at 14:18

answered Oct 07 '13 at 10:46

cheekybastard

5,535
3
22
26

1

If it is a repost, why doesn't the original post include: 1. requests 2.list comp 3. logic to scrub site internal & junk links ?? Try and compare the results of the two posts, my list comp does a surprisingly good job scrubbing the junk links. – cheekybastard Dec 15 '13 at 23:30
The OP did not ask for those features and the part that he did ask for has already been posted and solved using the exact same method as you post. However, I'll remove the downvote as the list comprehension does add value for people that do want those features and you do explicitly mention them in the body of the post. Also, you could use the rep :) – dotancohen Dec 16 '13 at 07:43

score 5 · Answer 8 · answered Jan 21 '15 at 21:10

This script does what your looking for, But also resolves the relative links to absolute links.

import urllib
import lxml.html
import urlparse

def get_dom(url):
    connection = urllib.urlopen(url)
    return lxml.html.fromstring(connection.read())

def get_links(url):
    return resolve_links((link for link in get_dom(url).xpath('//a/@href')))

def guess_root(links):
    for link in links:
        if link.startswith('http'):
            parsed_link = urlparse.urlparse(link)
            scheme = parsed_link.scheme + '://'
            netloc = parsed_link.netloc
            return scheme + netloc

def resolve_links(links):
    root = guess_root(links)
    for link in links:
        if not link.startswith('http'):
            link = urlparse.urljoin(root, link)
        yield link  

for link in get_links('http://www.google.com'):
    print link

This doesn't do what ti's meant to do; if resolve_links() doesn't have a root, then it never returns any URLs. — MikeB, May 12 '20 at 15:22

score 5 · Answer 9 · answered Aug 06 '15 at 03:22

To find all the links, we will in this example use the urllib2 module together with the re.module *One of the most powerful function in the re module is "re.findall()". While re.search() is used to find the first match for a pattern, re.findall() finds all the matches and returns them as a list of strings, with each string representing one match*

import urllib2

import re
#connect to a URL
website = urllib2.urlopen(url)

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

score 4 · Answer 10 · answered May 27 '12 at 01:49

4

Why not use regular expressions:

import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
    print('href: %s, HTML text: %s' % (link[0], link[1]))

answered May 27 '12 at 01:49

ahmadh

1,582
2
18
28

1

i'd love to be able to understand this, where can i efficiently find out what `(r"(.*?)", page)` means? thanks! – user1063287 Apr 06 '13 at 04:46
9

Really a bad idea. Broken HTML everywhere. – Ufoguy Jan 19 '14 at 16:35
2

Why not use regular expressions to parse html: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?page=1&tab=votes#1732454 – allcaps Mar 18 '14 at 10:08
@user1063287, the web is full of regex tutorials. It's well worth your time to read a couple. While REs can get really convoluted, the one you're asking about is pretty basic. – alexis Jun 10 '16 at 23:20

score 4 · Answer 11 · answered Jul 04 '09 at 03:11

4

just for getting the links, without B.soup and regex:

import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
    if "<a href" in item:
        try:
            ind = item.index(tag)
            item=item[ind+len(tag):]
            end=item.index(endtag)
        except: pass
        else:
            print item[:end]

for more complex operations, of course BSoup is still preferred.

answered Jul 04 '09 at 03:11

ghostdog74

327,991
56
259
343

7

And if, for instance, there's something inbetween ` – dimo414 Sep 12 '12 at 21:28
is there a way to filter out only some links with this? like say i only want links that has "Episode" in the link? – nwgat Apr 25 '17 at 00:18

ccpizza · Answer 12 · 2017-05-27T08:44:34.877

BeatifulSoup's own parser can be slow. It might be more feasible to use lxml which is capable of parsing directly from a URL (with some limitations mentioned below).

import lxml.html

doc = lxml.html.parse(url)

links = doc.xpath('//a[@href]')

for link in links:
    print link.attrib['href']

The code above will return the links as is, and in most cases they would be relative links or absolute from the site root. Since my use case was to only extract a certain type of links, below is a version that converts the links to full URLs and which optionally accepts a glob pattern like *.mp3. It won't handle single and double dots in the relative paths though, but so far I didn't have the need for it. If you need to parse URL fragments containing ../ or ./ then urlparse.urljoin might come in handy.

NOTE: Direct lxml url parsing doesn't handle loading from https and doesn't do redirects, so for this reason the version below is using urllib2 + lxml.

#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch

try:
    import urltools as urltools
except ImportError:
    sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
    urltools = None


def get_host(url):
    p = urlparse.urlparse(url)
    return "{}://{}".format(p.scheme, p.netloc)


if __name__ == '__main__':
    url = sys.argv[1]
    host = get_host(url)
    glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'

    doc = lxml.html.parse(urllib2.urlopen(url))
    links = doc.xpath('//a[@href]')

    for link in links:
        href = link.attrib['href']

        if fnmatch.fnmatch(href, glob_patt):

            if not href.startswith(('http://', 'https://' 'ftp://')):

                if href.startswith('/'):
                    href = host + href
                else:
                    parent_url = url.rsplit('/', 1)[0]
                    href = urlparse.urljoin(parent_url, href)

                    if urltools:
                        href = urltools.normalize(href)

            print href

The usage is as follows:

getlinks.py http://stackoverflow.com/a/37758066/191246
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*"
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

`lxml` can only handle valid input, how can it replace `BeautifulSoup`? — alexis, Jun 10 '16 at 22:41
@alexis: I think `lxml.html` is a bit more lenient than the `lxml.etree`. If your input is not well-formed then you can explicitly set the BeautifulSoup parser: http://lxml.de/elementsoup.html. And if you do go with BeatifulSoup then BS3 is a better choice. — ccpizza, Jun 10 '16 at 22:48

score 2 · Answer 13 · answered Jul 11 '16 at 18:58

Here's an example using @ars accepted answer and the BeautifulSoup4, requests, and wget modules to handle the downloads.

import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'

response = requests.get(url)

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path = url + link['href']
            wget.download(full_path)

score 2 · Answer 14 · edited May 25 '17 at 16:52

I found the answer by @Blairg23 working , after the following correction (covering the scenario where it failed to work correctly):

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
            wget.download(full_path)

For Python 3:

urllib.parse.urljoin has to be used in order to obtain the full URL instead.

score 2 · Answer 15 · answered Oct 10 '19 at 20:08

There can be many duplicate links together with both external and internal links. To differentiate between the two and just get unique links using sets:

# Python 3.
import urllib    
from bs4 import BeautifulSoup

url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))  
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
    link = line.get('href')
    if not link:
        continue
    if link.startswith('http'):
        external_links.add(link)
    else:
        internal_links.add(link)

# Depending on usage, full internal links may be preferred.
full_internal_links = {
    urllib.parse.urljoin(url, internal_link) 
    for internal_link in internal_links
}

# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
    print(link)

score 0 · Answer 16 · answered Sep 04 '14 at 19:00

0

import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']

answered Sep 04 '14 at 19:00

Tilak Patidar

115
1
4

retrieve links from web page using python and BeautifulSoup

16 Answers16

Linked

Related