1

If I use urllib to load this url( https://www.fundingcircle.com/my-account/sell-my-loans/ ) I get a 400 status error.

e.g. The following returns a 400 error

>>> import urllib
>>> f = urllib.urlopen("https://www.fundingcircle.com/my-account/sell-my-loans/")
>>> print f.read()

However, if I copy and paste the url into my browser, I see a web page with the information that I want to see.

I have tried using a try, except, and then reading the error. But the returned data just tells me that the page does not exist. e.g.

import urllib
try:
    f = urllib.urlopen("https://www.fundingcircle.com/my-account/sell-my-loans/")
except Exception as e:
    eString = e.read()
    print eString

Why can't Python load the page?

Ginger
  • 8,320
  • 12
  • 56
  • 99
  • Do you have a permission from fundingcircle.com to scrape their website? – Tymoteusz Paul Oct 27 '14 at 22:13
  • Puciek, according to their T&C's, they are aware that people are doing it, but they don't have a policy on it yet. – Ginger Oct 27 '14 at 22:14
  • You are not logged into the site + you may change the user agent and other headers to imitate a browser. You might need to maintain a state between the calls : http://stackoverflow.com/questions/4414683/how-can-i-log-into-a-website-using-python – Avia Oct 27 '14 at 22:18
  • Try spoofing some of the headers. – anon582847382 Oct 27 '14 at 22:22
  • @Puciek whether or not code violates another sites Terms of Service has been done to death, and the position is that StackOverflow has no responsibility over what askers do with their code. While they might not suit your ethics, as long as they are researched and have example code, questions like "Why isn't this missle code correctly targeting civilians?" or "How can I circumvent this rate limiting?" are on-topic. –  Oct 28 '14 at 23:17
  • @LegoStormtroopr and where did I say that this is NOT on topic? All I have said that since he didn't ask, he receives a down vote from me, which I am allowed to do based on my ethics. – Tymoteusz Paul Oct 28 '14 at 23:19

1 Answers1

5

If Python is given a 404 status then that'd be because the server refuses to give you the page.

Why that is is difficult to know, because servers are black boxes. But your browser gives the server more than just the URL, it also gives it a set of HTTP headers. Most likely the server alters behaviour based on the contents of one or more of those headers.

You need to look in your browser development tools and see what your browser sends, then try and replicate some of those headers from Python. Obvious candidates are the User-Agent header, followed by Accept and Cookie headers.

However, in this specific case, the server is responding with a 401 Unauthorized; you are given a login page. It does this both for the browser and Python:

>>> import urllib
>>> urllib.urlopen('https://www.fundingcircle.com/my-account/sell-my-loans/')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 87, in urlopen
    return opener.open(url)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 208, in open
    return getattr(self, name)(url)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 451, in open_https
    return self.http_error(url, fp, errcode, errmsg, headers)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 372, in http_error
    result = method(url, fp, errcode, errmsg, headers)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 683, in http_error_401
    errcode, errmsg, headers)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 381, in http_error_default
    raise IOError, ('http error', errcode, errmsg, headers)
IOError: ('http error', 401, 'Unauthorized', <httplib.HTTPMessage instance at 0x1066f9a28>)

but Python's urllib doesn't have a handler for the 401 status code and turns that into an exception.

The response body contains a login form; you'll have to write code to log in here, and presumably track cookies.

That task would be a lot easier with more specialised tools. You could use robobrowser to load the page, parse the form and give you the tools to fill it out, then post the form for you and track the cookies required to keep you logged in. It is built on top of the excellent requests and BeautifulSoup libraries.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343