How can I load all of a site's resources, including AJAX requests, etc.. in Python?

Question

I know how to request a web site and read its text with Python. In the past, I've tried using a library like BeautifulSoup to make all of the requests to links on a site, but that doesn't get things that don't look like full urls, such as AJAX requests and most requests to the original domain (since the "http://example.com" will be missing, and more importantly, isn't in an <a href='url'>Link</a>format, so BeautifulSoup will miss that).

How can I load all of a site's resources in Python? Will it require interacting with something like Selenium, or is there a way that's not too difficult to implement without that? I haven't used Selenium much, so I'm not sure how difficult that will be.

Thanks

Do you want to download the complete web page as the browser does after hitting `Ctrl/Command + S` (Save menu item), right? — alecxe, Aug 10 '14 at 22:59
That would do the trick! I think I may have just found it using the below (happy to hear other options though, and I wish I had the link that I found this from...): `import urllib2 url = 'http://example.com' headers = {'User-Agent' : 'Mozilla/5.0'} request = urllib2.Request(url,None,headers) sock = urllib2.urlopen(request) ch = sock.read() sock.close()` — rofls, Aug 10 '14 at 23:04
Sweet, that'd be nice to see how to do it with Selenium. Do you mean with one of the client programming libraries? I'm personally more interested in the client coding libraries than the IDE/macro creator. — rofls, Aug 10 '14 at 23:13
Durr, yup :) That's nice to know that you're pretty sure it's doable with Selenium. I think that could definitely be a useful tool for me down the road. What do you use it for, strictly QA, or have you found it useful for other purposes? — rofls, Aug 10 '14 at 23:17
Yeah, it is almost doable. I managed to fire up "Save as" dialog using Firefox, but had no luck making it saving a complete web page with all files automatically - you would have to manually (or using tools like AutoIt) click "Save" in the "Save as" dialog. So, if this is ok for you - I can post the solution. Thanks. — alecxe, Aug 11 '14 at 03:32
Oops, no need for posting a solution: here's basically it: http://stackoverflow.com/questions/14516590/how-to-save-complete-webpage-not-just-basic-html-using-python. — alecxe, Aug 11 '14 at 03:34

score 2 · Answer 1 · answered Aug 11 '14 at 08:34

It all depends on what you want and how you want it. The closest that may work for you is

from ghost import Ghost
ghost = Ghost()
page, extra_resources = ghost.open("http://jeanphi.fr")
assert page.http_status==200 and 'jeanphix' in ghost.content

You can know more on: http://jeanphix.me/Ghost.py/

rofls · Answer 2 · 2015-12-30T18:18:47.647

I would love to hear other ways of doing this, especially if they're more concise (easier to remember), but I think this accomplishes my goal. It does not fully answer my original question though--this just gets more of the stuff than using requests.get(url)--which was enough for me in this case`:

import urllib2
url = 'http://example.com'
headers = {'User-Agent' : 'Mozilla/5.0'}
request = urllib2.Request(url,None,headers)
sock = urllib2.urlopen(request)
ch = sock.read()
sock.close()

score 0 · Answer 3 · edited May 23 '17 at 11:44

Mmm that's a pretty interesting question. For those resources whose URLs are not fully identifiable due to them being generated at runtime or something like that (such as those used in scripts, not only AJAX) you'd need to actually run the website, so scripts get executed and dynamic URLs get created.

One option is using something like what this answer describes, which is using a third party library, like Qt, to actually run the website. To collect all URLs, you need some way of monitoring all requests made by the website, that could be done like this (although it's c++, but the code's essentially the same).

Finally once you have the URL's, you can use something like Requests to download the external resources.

How can I load all of a site's resources, including AJAX requests, etc.. in Python?

3 Answers3