Server errors while web scraping https in python

Question

I'm trying to scrape this site https://propaccess.trueautomation.com/ClientDB/Property.aspx?prop_id=17471

I can type the address directly into my url bar, and I get the results I want, but when I scrape in python, i only get the source code for a "runtime error" page.

I'm thinking it might have something to do with https because I can scrape pages in the clear like craigslist.

My code is as follows,

import urllib
import re

domain = "https://propaccess.trueautomation.com/ClientDB/Property.aspx?
prop_id=17471"


htmlfile = urllib.urlopen(domain)
htmltext = htmlfile.read()
print htmltext

I'm new to python, but not to the internet. I was assuming if I could type the url into the browswer with success, I'd be able to type the same url into python. That seems to not be the case, and I don't have a clue why.

Thanks. Mike

Update: If I browse to said url in a browser I have never used to surf this page, I get the "runtime error" page.

This appears to be an issue with the website you are trying to scrape. As such, I don't think it is really on topic here. — cdhowie, Jul 28 '14 at 22:02
I don't believe it's an issue on the page, I'm thinking it has something to do with https, maybe the certificate. I'm not sure because I'm very new to scraping, and this is my first attempt at scraping a https site. — Mike82, Jul 28 '14 at 22:13
It's more likely to be due to the absence of a cookie, though it would be better if the application would give you a nice error page indicating what went wrong. Either way, the problem is with the web application. (If it was a certificate problem you wouldn't even have gotten this far -- you'd get a connection failure.) — cdhowie, Jul 28 '14 at 22:14
That would make sense, as I could initially not view the page in a different browser, until I went back to the main menu on that page. I'm guessing it set a cookie, and now I can view it all I want. So now the question is, what piece of information am I missing as far as my python code goes? What topic do I need to research? Thanks for the help btw. — Mike82, Jul 28 '14 at 22:22
Basically you need to figure out what cookie is being set. Chrome's built in developer panel could help here (press F12, use the network tab); other browsers have similar extensions. Then you will need to [send that cookie in your Python code](http://stackoverflow.com/q/3334809/501250). Good luck! — cdhowie, Jul 28 '14 at 22:24
So to sum up, the server wants to issue a cookie, but my script is unable to handle cookies? — Mike82, Jul 28 '14 at 22:28
Well, your script is unable to handle cookies, but you need to visit another page *first* (the one that sets the cookie) before visiting this page. That's why you get the error in a new browser session. [This question](http://stackoverflow.com/q/5825957/501250) shows a mechanism you can use to have cookies handled automatically for you -- then all you would have to do is first fetch whichever page sets the cookie, then fetch the page you are after. — cdhowie, Jul 28 '14 at 22:31
This question appears to be off-topic because it is about what response the remote application produces, not about your code. — Celada, Jul 29 '14 at 03:46

score 0 · Answer 1 · answered Jul 28 '14 at 22:13

I cannot access the page you linked. It seems like you are on an authenticated session, and your python code, of course, has no idea what's going on. It, thus, will return the "permission denied" or the sort of result.

If so, you probably want to pass the session cookie when you request. The Requests library hopefully will do what you need.

(http://docs.python-requests.org/en/latest/user/advanced/#session-objects)

Hint: when you do scraping job, use incognito mode to see a web page. How the page looks will be exactly the same to your python environment.

Server errors while web scraping https in python

1 Answers1