I can't get the whole source code of an HTML page

Question

Using Python, I want to crawl data on a web page whose source if quite big (it is a Facebook page of some user).

Say the URL is the URL I am trying to crawl. I run the following code:

import urllib2

usock = urllib2.urlopen(url)
data = usock.read()
usock.close()

Data is supposed to contain the source of the page I am crawling, but for some reason, it doesn't contain all the characters that are available when I compare directly with the source of the page. I don't know what I am doing wrong. I know that the page I am trying to crawl has not been updated recently, so it is not due to the fact that I am missing some very recent data.

Does someone have a clue?

EDIT: the kind of information I am missing is like:

<code class="hidden_elem" id="up82eq_33"><!-- <div class="mbs profileInfoSection"><div class="uiHeader uiHeaderTopAndBottomBorder uiHeaderSection infoSectionHeader"><div class="clearfix uiHeaderTop"><div><h4 tabindex="0" class="uiHeaderTitle">Basic Information</h4></div></div></div><div class="phs"><table class="uiInfoTable mtm profileInfoTable uiInfoTableFixed"><tbody><tr><th class="label">Networks</th><td class="data"><div class="uiCollapsedList uiCollapsedListHidden" id="up82eq_32"><span class="visible">XXXX</span></div></td></tr></tbody></table></div></div> --></code>

It's basically some field I am interested in. What surprises me is that I can get some fields, but not all.

javascript is loading some content maybe and your crawler is not executing it? — Osama Javed, Jul 24 '12 at 10:13
@dyoser I did check the code charset, thanks for the suggestion, but unfortunately it's not the cause of my problem. — S4M, Jul 24 '12 at 10:52
possible duplicate of [Python Selenium accessing HTML source](http://stackoverflow.com/questions/7861775/python-selenium-accessing-html-source) — durron597, Sep 03 '15 at 15:49

Stan · Answer 1 · 2012-07-24T10:53:20.950

This page may execute some javascript and javascript generates some content.
Try Twill.
It based on Mechanize, but executes javascript.
Sample in Python:

from twill.commands import *
go("http://google.com/")
fv("f", "q", "test")
submit("btnG")
info() #shows page info
show() #shows html

Another option is to use Zombie.js on Node.js.
This library works even better then Twill and it is browserless solution.
Sample in Coffeescript:

zombie = require "zombie"
browser = new zombie()
browser.visit "https://www.google.ru/", =>
    browser.fill "q", "node.js"
    browser.pressButton "Поиск в Google", ->
        for item in browser.queryAll "h3.r a"
            console.log item.innerHTML

I just tried Twill, but it's not working as well. In fact, I have even more info missing than with urllib2 — S4M, Jul 24 '12 at 10:29

score 2 · Accepted Answer · answered Jul 24 '12 at 10:21

2

Facebook is heavily Javascript orientated. The page source you see in the browser is the DOM after after any JS code has run (and the page source will frequently be changing anyway). You may have to automate a browser (using Selenium), or try other tools such as mechanize... Or look into a proper FB app and use the FB API.

answered Jul 24 '12 at 10:21

Jon Clements

138,671
33
247
280

I just tried with mechanize like that: resp = mechanize.urlopen(url); txt = resp.read() But still the same problem... – S4M Jul 24 '12 at 10:39
@S4M May well just have to use Selenium then http://seleniumhq.org/ - Bear in mind data may not exist until certain user events occur (such as clicking items, or expanding menus) – Jon Clements Jul 24 '12 at 10:42

I can't get the whole source code of an HTML page

2 Answers2

Linked

Related