How to gather entire source of web page (Source only shows top 10 X.)

Question

I'm trying to create a program that will go through a bunch of tumblr photos and extract the username of the person who uploaded them.
http://www.tumblr.com/tagged/food

If you look here, you can see multiple pictures of food with multiple different uploaders. If you scroll down you will begin to see even more pictures with even more uploaders. If you right click in your browser to view the source, and search "username", however, it will only yield 10 results. Every time, no matter how far down you scroll.

Is there any way to counter this and have instead have it display the entire source for all images, or for X amount of images, or for however far you scrolled?

Here is my code to show what I'm doing:

#Imports
import requests
from bs4 import BeautifulSoup
import re

#Start of code
r = requests.get('http://www.tumblr.com/tagged/skateboard')
page = r.content

soup = BeautifulSoup(page)
soup.prettify()
arrayDiv = []

for anchor in soup.findAll("div", { "class" : "post_info" }):
    anchor = str(anchor)
    tempString = anchor.replace('</a>:', '')
    tempString = tempString.replace('<div class="post_info">', '') 
    tempString = tempString.replace('</div>', '')
    tempString = tempString.split('>')
    newString = tempString[1]
    newString = newString.strip()

    arrayDiv.append(newString)

print arrayDiv

On the link you give there are only 10 items displayed... so you should expect the source to have 10 items too. You need to click "Next" to see the next 10 items... I can't scroll more than 10. Maybe you need to be signed-in to be able to scroll more? — JScoobyCed, Mar 16 '12 at 10:16
@JScoobyCed What browser are you using? On chrome, you scroll down and it auto repopulates with the next amount of updates. That's why I'm not sure of how to get the source for the next amount of items. ideas? — Anteara, Mar 16 '12 at 10:17
You can't, without executing the scripts on the page, which grabbing the page with Requests and parsing the HTML isn't going to do. (possible duplicate of [Scraping websites with Javascript enabled?](http://stackoverflow.com/questions/3362859/scraping-websites-with-javascript-enabled)) — Wooble, Mar 16 '12 at 10:21
I was using Firefox 11. Now trying in Chrome 17 with same result: only 10 items are shown. I agree with Wooble that if the page is using an "Infinite Scroll" kind of update, then you need to get the AJAX request call and loop in your code to fetch each parts — JScoobyCed, Mar 16 '12 at 10:32
@JScoobyCed Alright then, I've read part of that scraping websites with javascript enabled website, and it has a link of a person using , pywebkitgtk. Would this achieve what I would like to do? - I don't have time at the moment to read it, so I apologize if that question is daft. If not, do you know if requests can do an AJAX request call? — Anteara, Mar 16 '12 at 10:41
no need to javascript scraping.. you can do the same with python from server side.. explanation in my answer below — alonisser, Mar 16 '12 at 11:23
btw, you almost certainly want to use Tumblr's API. (http://www.tumblr.com/docs/en/api/v2). I can't be bothered to read the terms of service for a site I don't use, but don't be shocked if they block your IP for scraping. — Wooble, Mar 16 '12 at 13:42

score 1 · Accepted Answer · answered Mar 16 '12 at 11:22

1

I had solved a similiar problem using beautifulsoup. what I did is looping through the paged pages. check with beautifulsoup is there is a continue element - here(in the tumbler page) for example this is an element with an id "next_page_link" if there is one I would loop the photo scraping code while changing the url fetched by requests. you would need to encapsulate all the code in a function ofcourse

good luck.

answered Mar 16 '12 at 11:22

alonisser

11,542
21
85
139

Alright, so you're essentially saying this: I added this as a comment in my script so I don't forget. So, is this what you mean? `#Put the code in a function. Look through the tumblr.com/tagged/food source code. Find "next_page_link". These are unique to each page so I should be able to use it for each time. I want around 200, so I can loop through it around 20 times to have 200 elements in my array.` Yeah? – Anteara Mar 16 '12 at 11:42
Update, using what you suggested above I was able to create an array and fill it with 200 test names. Thanks so much. – Anteara Mar 16 '12 at 12:15

How to gather entire source of web page (Source only shows top 10 X.)

1 Answers1