2

I have a URL:

http://www.goudengids.be/qn/business/advanced/where/Provincie%20Antwerpen/what/restaurant

On that page there is a "next results" button which loads another 20 data point while still showing first dataset, without updating the URL. I wrote a script to scrape this page in python but it only scrapes the first 22 data point even though the "next results" button is clicked and shows about 40 data.

How can I scrape these types of website that dynamically load content

My script is

import csv
import requests
from bs4 import BeautifulSoup


url = "http://www.goudengids.be/qn/business/advanced/where/Provincie%20Antwerpen/what/restaurant/"
r = requests.get(url)
r.content

soup = BeautifulSoup(r.content)
print (soup.prettify())

g_data2 = soup.find_all("a", {"class": "heading"})
for item in g_data2:
    try:
        name = item.text
        print name
    except IndexError:
        name = ''
        print "No Name found!"
Prateek Gupta
  • 119
  • 2
  • 7
vishnu
  • 47
  • 2
  • 7

2 Answers2

2

If you were to solve it with requests, you need to mimic what browser does when you click the "Load More" button - it sends an XHR request to the http://www.goudengids.be/q/ajax/business/results.json endpoint, simulate it in your code maintaining the web-scraping session. The XHR responses are in JSON format - no need for BeautifulSoup in this case at all:

import requests

main_url = "http://www.goudengids.be/qn/business/advanced/where/Provincie%20Antwerpen/what/restaurant/"
xhr_url = "http://www.goudengids.be/q/ajax/business/results.json"
with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}

    # visit main URL
    session.get(main_url)

    # load more listings - follow the pagination
    page = 1
    listings = []
    while True:
        params = {
            "input": "restaurant Provincie Antwerpen",
            "what": "restaurant",
            "where": "Provincie Antwerpen",
            "type": "DOUBLE",
            "resultlisttype": "A_AND_B",
            "page": str(page),
            "offset": "2",
            "excludelistingids": "nl_BE_YP_FREE_11336647_0000_1746702_6165_20130000, nl_BE_YP_PAID_11336647_0000_1746702_7575_20139729427, nl_BE_YP_PAID_720348_0000_187688_7575_20139392980",
            "context": "SRP * A_LIST"
        }
        response = requests.get(xhr_url, params=params, headers={
            "X-Requested-With": "XMLHttpRequest",
            "Referer": main_url
        })
        data = response.json()

        # collect listing names in a list (for example purposes)
        listings.extend([item["bn"] for item in data["overallResult"]["searchResults"]])

        page += 1

        # TODO: figure out exit condition for the while True loop

    print(listings)

I've left an important TODO for you - figure out an exit condition - when to stop collecting listings.

Graham
  • 7,431
  • 18
  • 59
  • 84
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • when I ran your script it gave me a error message Traceback (most recent call last): File "C:\Users\User\Desktop\python\script\3url.py", line 3, in with requests.Session() as session: NameError: name 'requests' is not defined How can I fix it?? – vishnu Jul 27 '16 at 09:00
  • @vishnu se this `import requests` line on top? This is important. And you have to have `requests` module installed. – alecxe Jul 27 '16 at 12:59
  • ya you are correct @alecxe I really forgot. Thank you for your big help and also I need you in future – vishnu Jul 28 '16 at 15:34
  • @alexce here I have another URL http://www.theknowledgeonline.com/production-companies in this link I need to scrape name address phone number email and etc. But the data are not found straight forwardly. I need to click every link and it get into new page, gives entire data. How can scrape these types of URLs?? – vishnu Jul 28 '16 at 17:30
1

Instead of focusing on scraping HTML I think you should look at the JSON that is retrieved via AJAX. I think the JSON is less likely to be changed in the future as opposed to the page's markup. And on top of that, it's way easier to traverse a JSON structure than it is to scrape a DOM.

For instance, when you load the page you provided it hits a url to get JSON at http://www.goudengids.be/q/ajax/business/results.json.

Then it provides some url parameters to query the businesses. I think you should look more into using this to get your data as opposed to scraping the page and simulating button clicks, and etc.

Edit:

And it looks like it's using the headers set from visiting the site initially to ensure that you have a valid session. So you may have to hit the site initially to get the cookie headers and set that for subsequent requests to get the JSON from the endpoint above. I still think this will be easier and more predictable than trying to scrape HTML.

arjabbar
  • 6,044
  • 4
  • 30
  • 46