0

I wrote a program to extract all links to PDF files in a web page. The program works perfect, with no errors on some websites, like for example:

Hussam# python extractPDF.py http://www.cs.odu.edu/~mln/teaching/cs532-s17/test/pdfs.html

Output:

Entered URL:
http://www.cs.odu.edu/~mln/teaching/cs532-s17/test/pdfs.html
Final URL:
http://www.cs.odu.edu/~mln/teaching/cs532-s17/test/pdfs.html
http://www.cs.odu.edu/~mln/pubs/ht-2015/hypertext-2015-temporal-violations.pdf
Size: 2184076
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-annotations.pdf
Size: 622981
http://arxiv.org/pdf/1512.06195
Size: 1748961
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-off-topic.pdf
Size: 4308768
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-stories.pdf
Size: 1274604
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-profiling.pdf
Size: 639001
http://www.cs.odu.edu/~mln/pubs/jcdl-2014/jcdl-2014-brunelle-damage.pdf
Size: 2205546
http://www.cs.odu.edu/~mln/pubs/jcdl-2015/jcdl-2015-mink.pdf
Size: 1254605
http://www.cs.odu.edu/~mln/pubs/jcdl-2015/jcdl-2015-arabic-sites.pdf
Size: 709420
http://www.cs.odu.edu/~mln/pubs/jcdl-2015/jcdl-2015-dictionary.pdf
Size: 2350603

On the other hand, if I try this link:

Hussam# python extractPDF.py http://www.cs.odu.edu/~mln/pubs/all.html

I get the output correct but it has an error at the end.

Entered URL:
http://www.cs.odu.edu/~mln/pubs/all.html
Final URL:
http://www.cs.odu.edu/~mln/pubs/all.html
http://www.cs.odu.edu/~mln/pubs/tpdl-2016/tpdl-2016-kelly.pdf
Size: 953454
http://www.cs.odu.edu/~mln/pubs/tpdl-2016/tpdl-2016-alam.pdf
Size: 928749
http://www.cs.odu.edu/~mln/pubs/jcdl-2016/jcdl-2016-alam-ipfs.pdf
Size: 516538
http://www.cs.odu.edu/~mln/pubs/jcdl-2016/jcdl-2016-alam-memgator.pdf
Size: 345028
http://www.cs.odu.edu/~mln/pubs/jcdl-2016/jcdl-2016-nwala.pdf
Size: 640173
http://www.cs.odu.edu/~mln/pubs/ht-2015/hypertext-2015-temporal-violations.pdf
Size: 2184076
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-annotations.pdf
Size: 622981
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-off-topic.pdf
Size: 4308768
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-stories.pdf
Size: 1274604
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-profiling.pdf
Size: 639001
http://www.cs.odu.edu/~mln/pubs/jcdl-2015/jcdl-2015-temporal-intention.pdf
Size: 720476
http://www.cs.odu.edu/~mln/pubs/jcdl-2015/jcdl-2015-mink.pdf
Size: 1254605
http://www.cs.odu.edu/~mln/pubs/jcdl-2015/jcdl-2015-arabic-sites.pdf
Size: 709420
http://www.cs.odu.edu/~mln/pubs/jcdl-2015/jcdl-2015-dictionary.pdf
Size: 2350603
http://www.cs.odu.edu/~mln/pubs/jcdl-2014/jcdl-2014-kelly-acid.pdf
Size: 541843
http://www.cs.odu.edu/~mln/pubs/jcdl-2014/jcdl-2014-kelly-mink.pdf
Size: 556863
http://www.cs.odu.edu/~mln/pubs/jcdl-2014/jcdl-2014-brunelle-damage.pdf
Size: 2205546
http://www.cs.odu.edu/~mln/pubs/jcdl-2014/jcdl-2014-cartledge-copies.pdf
Size: 1199511
http://www.cs.odu.edu/~mln/pubs/sigcse-2014/web-science-sigcse-2014.pdf
Size: 158242
http://www.cs.odu.edu/~mln/pubs/ecir-2014/ecir-2014.pdf
Size: 902825
http://www.cs.odu.edu/~mln/pubs/ieee-vis-2013/2013-ieee-vis-boxoffice.pdf
Size: 122738
Traceback (most recent call last):
  File "extractPDF.py", line 21, in <module>
    r = urllib2.urlopen(link)
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 429, in error
    result = self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 605, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib/python2.7/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 435, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 518, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

Here is the code to the program.

import sys
from bs4 import *
import urllib2
import re

if len(sys.argv) != 2:
        print "USAGE:"
        print "Python extracrPDF.py http://example.com/page.html"
else:
        url = sys.argv[1]
        print "Entered URL:"
        print url
        html_page = urllib2.urlopen(url)
        print "Final URL:"
        print html_page.geturl()
        soup = BeautifulSoup(html_page, "html.parser")
        links = []
        for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
                links.append(link.get('href'))
        for link in links:
                r = urllib2.urlopen(link)
                if r.headers['content-type'] == "application/pdf":
                        print link
                        print "Size: " + r.headers['Content-Length']
Harshdeep Sokhey
  • 324
  • 5
  • 15
Hussam Hallak
  • 303
  • 4
  • 21
  • 1
    403 forbidden. You're not allowed to access whatever the value is in `req.get_furll_url()`. What don't you understand? – Wayne Werner Jan 24 '17 at 19:44
  • The issue is with the url that you are accessing as pointed out by @WayneWerner. Try putting a check for an **HttpError** exception. Check this [link](http://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden) for more info. – Harshdeep Sokhey Jan 24 '17 at 19:57
  • Is your crawler blocked because it's been detected? – mmenschig Jan 24 '17 at 20:59
  • Okay thanks.. My crawler is not blocked, but I forgot to write an exception to handle HttpError. – Hussam Hallak Jan 24 '17 at 21:26

1 Answers1

0
urllib2.HTTPError: HTTP Error 403: Forbidden

Your code fetches all the links in the page. At least one of those links (not necessarily a link for a PDF) is not available to you. 403 Forbidden means, "The server understood the request but refuses to authorize it." The url probably requires you to have credentials permitting you access.

urllib2 raises exceptions for error conditions. Your code will need to handle some of them.

If you just want your code to continue without dying, replace the relevant section with:

    for link in links:
        r = None
        try:
            r = urllib2.urlopen(link)
        except urllib2.HTTPError as e:
            print link
            print "Error: " + e.code + " " + e.reason
            continue

        if r.headers['content-type'] == "application/pdf":
            print link
            print "Size: " + r.headers['Content-Length']
Ouroborus
  • 16,237
  • 4
  • 39
  • 62