0

I am trying to scrape all the blog links on this page:http://hypem.com/track/26ed4/Skizzy+Mars+-+Way+I+Live

You click more to reveal the links. However only one link is visible in the html source. I am using BeautifulSoup, how would I get the other links?

Andrew
  • 6,295
  • 11
  • 56
  • 95
  • Either BS + Mechanize OR Scrapy. – shaktimaan Sep 09 '14 at 04:19
  • Please show what have you tried so far. – alecxe Sep 09 '14 at 04:22
  • I have tried BeautifulSoup. However the html source doesn't display all the links, and BeautifulSoup can only parse the html. So I am at a loss as to what to do. If you tell me other techniques to try I will try them and report back – Andrew Sep 09 '14 at 04:27
  • This answer looks promising, I will investigate: http://stackoverflow.com/questions/17597424/how-to-retrieve-the-values-of-dynamic-html-content-using-python – Andrew Sep 09 '14 at 04:29
  • What is your desired output? – alecxe Sep 09 '14 at 04:39
  • I want the links to the the blogs that appear when I click "more". using the method in the answer I mentioned, I can get the rendered html, but I don't know how to trigger the "more" button programatically. Once I can do that I'm fine as I can just use regex/beautifulsoup – Andrew Sep 09 '14 at 04:47

1 Answers1

2

You can stay with requests+BeautifulSoup approach. You just need to simulate the underlying requests going to the server when you click the More blogs button and scroll down the page.

Here's the code that prints all of the blog post image titles from the http://hypem.com/blogs page:

from bs4 import BeautifulSoup
import requests


def extract_blogs(content):
    first_page = BeautifulSoup(content)
    for link in first_page.select('div.directory-blog img'):
        print link.get('title')

# extract blogs from the main page
response = requests.get('http://hypem.com/blogs')
extract_blogs(response.content)

# paginate over rest results until there would be an empty response
page = 2
url = 'http://hypem.com/inc/serve_sites.php?featured=true&page={page}'

while True:
    response = requests.get(url.format(page=page))
    if not response.content.strip():
        break
    extract_blogs(response.content)
    page += 1

Prints:

Heart and Soul
Avant-Avant
Different Kitchen
Ladywood 
Orange Peel
Phonographe Corp
...
Stadiums & Shrines
Caipirinha Lounge
Gorilla Vs. Bear
ISO50 Blog
Fluxblog
Music ( for robots)

Hope this gives you at least the basic idea on how to scrape the web page contents in this case.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thank you for your detailed answer. I understand everything but how you chose that particular url. What is the logic behind that page/how do you know to use that one? – Andrew Sep 09 '14 at 05:38
  • @Andrew I've used browser developer tools (in my case Chrome) - Network tab - there you can inspect browser-server requests taking place. – alecxe Sep 09 '14 at 11:36