Why does text retrieved from pages sometimes look like gibberish?

Question

I'm using urllib and urllib2 in Python to open and read webpages but sometimes, the text I get is unreadable. For example, if I run this:

import urllib

text = urllib.urlopen('http://tagger.steve.museum/steve/object/141913').read()
print text

I get some unreadable text. I've read these posts:

Gibberish from urlopen

Does python urllib2 automatically uncompress gzip data fetched from webpage?

but can't seem to find my answer.

Thank you in advance for your help!

UPDATE: I fixed the problem by 'convincing' the server that my user-agent is a brower and not a crawler.

import urllib

class NewOpener(urllib.FancyURLopener):
  version = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.2 (KHTML, like Gecko) Ubuntu/11.10 Chromium/15.0.874.120 Chrome/15.0.874.120 Safari/535.2'

nop = NewOpener()
html_text = nop.open('http://tagger.steve.museum/steve/object/141913').read()

Thank you all for your replies.

The result of urlopen(youUrl) is a Javascript. Is this script really the content you want to get, or would like to get the actual content of the web page (what shows a browser) ? — Sébastien, Nov 25 '11 at 16:06

score 2 · Answer 1 · answered Nov 25 '11 at 16:09

This gibberish is a real server response for the request to 'http://tagger.steve.museum/steve/object/141913'. Actually, it looks like obfuscated JavaScript, which, if executed by a browser, loads page content.

To get this content, you need to execute this JavaScript, and this can be a really difficult task within Python. If you still want to do this, take a look at pywebkitgtk.

score 1 · Accepted Answer · answered Nov 25 '11 at 16:59

1

You can use Selenium to get the content. Download the server and client drivers, run server and run this:

from selenium import selenium
s = selenium("localhost", 4444, "*chrome", "http://tagger.steve.museum")
s.start()

s.open("/steve/object/141913")

text = s.get_html_source()
print text

answered Nov 25 '11 at 16:59

jan zegan

1,629
1
12
18

Thanks joshz! It turns out I will need Selenium because I need to execute the javascript before I'm able to view the page source the way you can see it in the browser. One quick question: If I run the above script on the interactive python interpreter, it works great. But if I store it in a file and run it all together it finds syntactic errors! Do you have any idea what might be causing this? – Siato Nov 25 '11 at 23:49
Not really without knowing what error it is, I ran it from a file with Python 2.7.2. My best guess is when running from file it's with a different Python version. – jan zegan Nov 26 '11 at 00:09
It magically fixed itself! I have no idea what was causing the problem ! Thanks for your suggestions! – Siato Nov 26 '11 at 00:32

Why does text retrieved from pages sometimes look like gibberish?

2 Answers2

Linked