3

I know how to request a web site and read its text with Python. In the past, I've tried using a library like BeautifulSoup to make all of the requests to links on a site, but that doesn't get things that don't look like full urls, such as AJAX requests and most requests to the original domain (since the "http://example.com" will be missing, and more importantly, isn't in an <a href='url'>Link</a>format, so BeautifulSoup will miss that).

How can I load all of a site's resources in Python? Will it require interacting with something like Selenium, or is there a way that's not too difficult to implement without that? I haven't used Selenium much, so I'm not sure how difficult that will be.

Thanks

rofls
  • 4,993
  • 3
  • 27
  • 37
  • 1
    Do you want to download the complete web page as the browser does after hitting `Ctrl/Command + S` (Save menu item), right? – alecxe Aug 10 '14 at 22:59
  • That would do the trick! I think I may have just found it using the below (happy to hear other options though, and I wish I had the link that I found this from...): `import urllib2 url = 'http://example.com' headers = {'User-Agent' : 'Mozilla/5.0'} request = urllib2.Request(url,None,headers) sock = urllib2.urlopen(request) ch = sock.read() sock.close()` – rofls Aug 10 '14 at 23:04
  • Sorry, that didn't come out how I had hoped. No new lines. – rofls Aug 10 '14 at 23:05
  • Sweet, that'd be nice to see how to do it with Selenium. Do you mean with one of the client programming libraries? I'm personally more interested in the client coding libraries than the IDE/macro creator. – rofls Aug 10 '14 at 23:13
  • 1
    Sure, python+selenium, as you've tagged. – alecxe Aug 10 '14 at 23:14
  • Durr, yup :) That's nice to know that you're pretty sure it's doable with Selenium. I think that could definitely be a useful tool for me down the road. What do you use it for, strictly QA, or have you found it useful for other purposes? – rofls Aug 10 '14 at 23:17
  • Yeah, it is almost doable. I managed to fire up "Save as" dialog using Firefox, but had no luck making it saving a complete web page with all files automatically - you would have to manually (or using tools like AutoIt) click "Save" in the "Save as" dialog. So, if this is ok for you - I can post the solution. Thanks. – alecxe Aug 11 '14 at 03:32
  • 1
    Oops, no need for posting a solution: here's basically it: http://stackoverflow.com/questions/14516590/how-to-save-complete-webpage-not-just-basic-html-using-python. – alecxe Aug 11 '14 at 03:34

3 Answers3

2

It all depends on what you want and how you want it. The closest that may work for you is

from ghost import Ghost
ghost = Ghost()
page, extra_resources = ghost.open("http://jeanphi.fr")
assert page.http_status==200 and 'jeanphix' in ghost.content

You can know more on: http://jeanphix.me/Ghost.py/

iChux
  • 2,266
  • 22
  • 37
0

I would love to hear other ways of doing this, especially if they're more concise (easier to remember), but I think this accomplishes my goal. It does not fully answer my original question though--this just gets more of the stuff than using requests.get(url)--which was enough for me in this case`:

import urllib2
url = 'http://example.com'
headers = {'User-Agent' : 'Mozilla/5.0'}
request = urllib2.Request(url,None,headers)
sock = urllib2.urlopen(request)
ch = sock.read()
sock.close()
rofls
  • 4,993
  • 3
  • 27
  • 37
0

Mmm that's a pretty interesting question. For those resources whose URLs are not fully identifiable due to them being generated at runtime or something like that (such as those used in scripts, not only AJAX) you'd need to actually run the website, so scripts get executed and dynamic URLs get created.

One option is using something like what this answer describes, which is using a third party library, like Qt, to actually run the website. To collect all URLs, you need some way of monitoring all requests made by the website, that could be done like this (although it's c++, but the code's essentially the same).

Finally once you have the URL's, you can use something like Requests to download the external resources.

Community
  • 1
  • 1
José Tomás Tocino
  • 9,873
  • 5
  • 44
  • 78