How to parse text in websites that display additional text after clicking a button, but that text is not in the base html

Question

I am trying to scrape all the blog links on this page:http://hypem.com/track/26ed4/Skizzy+Mars+-+Way+I+Live

You click more to reveal the links. However only one link is visible in the html source. I am using BeautifulSoup, how would I get the other links?

I have tried BeautifulSoup. However the html source doesn't display all the links, and BeautifulSoup can only parse the html. So I am at a loss as to what to do. If you tell me other techniques to try I will try them and report back — Andrew, Sep 09 '14 at 04:27
This answer looks promising, I will investigate: http://stackoverflow.com/questions/17597424/how-to-retrieve-the-values-of-dynamic-html-content-using-python — Andrew, Sep 09 '14 at 04:29
I want the links to the the blogs that appear when I click "more". using the method in the answer I mentioned, I can get the rendered html, but I don't know how to trigger the "more" button programatically. Once I can do that I'm fine as I can just use regex/beautifulsoup — Andrew, Sep 09 '14 at 04:47

score 2 · Accepted Answer · answered Sep 09 '14 at 05:03

You can stay with requests+BeautifulSoup approach. You just need to simulate the underlying requests going to the server when you click the More blogs button and scroll down the page.

Here's the code that prints all of the blog post image titles from the http://hypem.com/blogs page:

from bs4 import BeautifulSoup
import requests


def extract_blogs(content):
    first_page = BeautifulSoup(content)
    for link in first_page.select('div.directory-blog img'):
        print link.get('title')

# extract blogs from the main page
response = requests.get('http://hypem.com/blogs')
extract_blogs(response.content)

# paginate over rest results until there would be an empty response
page = 2
url = 'http://hypem.com/inc/serve_sites.php?featured=true&page={page}'

while True:
    response = requests.get(url.format(page=page))
    if not response.content.strip():
        break
    extract_blogs(response.content)
    page += 1

Prints:

Heart and Soul
Avant-Avant
Different Kitchen
Ladywood 
Orange Peel
Phonographe Corp
...
Stadiums & Shrines
Caipirinha Lounge
Gorilla Vs. Bear
ISO50 Blog
Fluxblog
Music ( for robots)

Hope this gives you at least the basic idea on how to scrape the web page contents in this case.

Thank you for your detailed answer. I understand everything but how you chose that particular url. What is the logic behind that page/how do you know to use that one? — Andrew, Sep 09 '14 at 05:38
@Andrew I've used browser developer tools (in my case Chrome) - Network tab - there you can inspect browser-server requests taking place. — alecxe, Sep 09 '14 at 11:36

How to parse text in websites that display additional text after clicking a button, but that text is not in the base html

1 Answers1