How can I parse long web pages with beautiful soup?

Question

I have been using following code to parse web page in the link https://www.blogforacure.com/members.php. The code is expected to return the links of all the members of the given page.

    from bs4 import BeautifulSoup
    import urllib
    r = urllib.urlopen('https://www.blogforacure.com/members.php').read()
    soup = BeautifulSoup(r,'lxml')
    headers = soup.find_all('h3')
    print(len(headers))
    for header in headers:
       a = header.find('a')
       print(a.attrs['href'])

But I get only the first 10 links from the above page. Even while printing the prettify option I see only the first 10 links.

results are loaded through ajax calls. When you reach the page end, new results are fetched from server. — neetesh, Jul 21 '16 at 11:02
My apporach is to use Selenium to interface with the page and scroll to the bottom, as described in: http://stackoverflow.com/questions/25870906/scrolling-web-page-using-selenium-python-webdriver — , Jul 21 '16 at 17:38

score 1 · Accepted Answer · edited Sep 27 '17 at 05:59

The results are dynamically loaded by making AJAX requests to the https://www.blogforacure.com/site/ajax/scrollergetentries.php endpoint.

Simulate them in your code with requests maintaining a web-scraping session:

from bs4 import BeautifulSoup
import requests

url = "https://www.blogforacure.com/site/ajax/scrollergetentries.php"
with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
    session.get("https://www.blogforacure.com/members.php")

    page = 0
    members = []
    while True:
        # get page
        response = session.post(url, data={
            "p": str(page),
            "id": "#scrollbox1"
        })
        html = response.json()['html']

        # parse html
        soup = BeautifulSoup(html, "html.parser")
        page_members = [member.get_text() for member in soup.select(".memberentry h3 a")]
        print(page, page_members)
        members.extend(page_members)

        page += 1

It prints the current page number and the list of members per page accumulating member names into a members list. Not posting what it prints since it contains names.

Note that I've intentionally left the loop endless, please figure out the exit condition. May be when response.json() throws an error.

I a new to this and I have a very basis question as on how did you get know about site/ajax/scrollergetentries.php ? How to get that for another page. and can you explain to me session.post() — athira, Jul 22 '16 at 07:05
@athira I've used browser developer tools, network tab when the page was loaded, then scrolled and saw multiple requests to the `scrollergetentries.php` endpoint. Hope that helps. — alecxe, Jul 22 '16 at 12:11

How can I parse long web pages with beautiful soup?

1 Answers1