2

I'm trying to use python and urllib to look at the code of a certain web page. I've tried and succeeded this at other webpages using the code:

from urllib import *
url = 
code = urlopen(url).read()
print code

But it returns nothing at all. My guess is it's because the page has a lot of javascripts? What to do?

Traveling Tech Guy
  • 27,194
  • 23
  • 111
  • 159
Alex T
  • 343
  • 1
  • 4
  • 9

1 Answers1

3

Dynamic client side generated pages (JavaScript)

You can not use urllib alone to see code that been rendered dynamically client side (JavaScript). The reason is that urllib only fetches the response from the server which is headers and the body (the actual code). Because of that I will not execute the client side code.

You can however use something like selenium to remote control a web browser (Chrome or Firefox). That will make it possible for you to scrap the page even though it renders with javascript.

Here is a sample of scraping with selenium: Using python with selenium to scrape dynamic web pages

But that is not your problem here

The problem with this site however seems to be that they don't want to be scraped. They block clients with certain http user-agent headers.

You can however get the code anyway if you fake the http headers. Use urllib2 instead of urllib like this:

import urllib2
req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox')  # Add fake client
response = urllib2.urlopen(req)
print response.read()

But, they clearly don't want you to scrape their site, so you should consider if this is a good idea.

Community
  • 1
  • 1
Niclas Nilsson
  • 5,691
  • 3
  • 30
  • 43
  • 1
    You have two options. (1) You use selenium and scrape it, it would not be that hard I believe. But not very efficient because you need the browser running. (2) You use a regex to extract the javascript variables and the try to interpret it into Python (maybe with the json module). That will be alot more work for you unfortunatly – Niclas Nilsson Jun 17 '13 at 07:01