How to retrieve the complete content of a web page by python

Question

Some web page doesn't show the full content when being loaded, but only display part of the content, to save the loading time.

If the user drag the scroll bar down, more and more content will be displayed.

My question is - how can I get the complete content of a web page by python?

In the begining I try

content = urlopen('http://www.kickstarter.com/projects/597507018/pebble-e-paper-watch-for-iphone-and-android/backers')

but it only get the first part of the page.

Thanks.

This is really too broad a question. Different sites use different techniques to create dynamic content. We can fill a book with the subject. — Martijn Pieters, Apr 13 '13 at 15:00
@MartijnPieters Thanks for your comment. I had thought there is a generic method to do this. Sorry for the "too broad a question". I've added the specific URL in my code. Thanks again. — Landy, Apr 13 '13 at 15:37

score 0 · Accepted Answer · edited May 23 '17 at 11:43

0

As Martijn Pieters points out, there are so many ways that this is accomplished by various websites. Because of this, you might want to make use of a headless browser. Here is a link to a question where this is discussed:

Headless Browser for Python (Javascript support REQUIRED!)

In this question, Richard gives the following answer which you might find usefule:

I use webkit as a headless browser in Python via pyqt / pyside: http://www.riverbankcomputing.co.uk/software/pyqt/download http://developer.qt.nokia.com/wiki/Category:LanguageBindings::PySide::Downloads

I particularly like webkit because it is simple to setup. For Ubuntu you just use:

sudo apt- get install python-qt4

Here is an example script: http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

I hope this helps.

P.S.: For future questions, try not to be a bit more specific with your question, so you don't get down-voted by others.

Edit: 2013-04-13 19:00 CAT

After looking at your updated question, with the specific URL you are investigating, I opened it up in Chrome and inspected the Network requests with the Developer Tools, and I see that what happens when you reach the bottom of the page, it calls a URL with the following format:

http://www.kickstarter.com/projects/597507018/pebble-e-paper-watch-for-iphone-and-android/backers?cursor=675683697

You just need to use the previous HTML to determine the proper cursor value to use.

edited May 23 '17 at 11:43

Community

1
1

answered Apr 13 '13 at 15:07

ralfe

1,412
2
15
25

Thanks for your answer. I've revised my question with the specific url. – Landy Apr 13 '13 at 15:38
And I really learnt a lot from your post. Thanks again. – Landy Apr 13 '13 at 15:39
I'm glad. If you appreciate an answer, it is always nice to upvote it by clicking on the up arrow to the left of the answer. – ralfe Apr 13 '13 at 16:47
Thank you, @ralfe. I've no enough reputatiion to click the 'up' arrow, but I've accepted your answer. Appreciate your help, indeed. – Landy Apr 13 '13 at 17:37
Thanks @Landy, did you take a look at my Edit? Did that make sense to you? – ralfe Apr 13 '13 at 18:52
yes, I saw your EDIT. The key point is to find what request is sent to server to request more content. You show me a good example, thanks a lot, and sorry for the ambiguous original question. You're so nice. – Landy Apr 15 '13 at 14:20

How to retrieve the complete content of a web page by python

1 Answers1