How to scrape data by triggering 'Read more' button

Question

I am trying to scrape the reviews from https://www.mouthshut.com/product-reviews/ICICI-Lombard-Auto-Insurance-reviews-925641018 using BeautifulSoup in python.

Actually the reviews content has a "Read more..." button. How can I trigger that button to fetch the whole content?

I found that a XHR request is fired when I click the button. How can I simulate that using python?

Also, after inspecting the "Read more..." button I got this:

<a style="cursor:pointer" onclick="bindreviewcontent('2836986',this,false,'I found this review of ICICI Lombard Auto Insurance pretty useful',925641018,'.jpg','I found this review of ICICI Lombard Auto Insurance pretty useful %23WriteShareWin','https://www.mouthshut.com/review/ICICI-Lombard-Auto-Insurance-review-rmlrrturotn','ICICI Lombard Auto Insurance',' 1/5','rmlrrturotn');">Read More</a>

How can I trigger the onclick event using python?

can you show what you have tried so far? – mnm Jun 16 '19 at 11:42 — mnm, Jun 16 '19 at 11:42

score 2 · Answer 1 · answered Jun 16 '19 at 13:18

Extracting all reviews with ratings and links

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd


def add_reviews(s, soup, results):
    for review in soup.select('.review-article'):
        info = review.select_one('a')
        identifier = review.select_one('[reviewid]')['reviewid']
        data['reviewid'] = identifier
        title = info.text
        link = info['href']
        rating = len(review.select('.rated-star'))
        r = s.post('https://www.mouthshut.com/review/CorporateResponse.ashx', data)
        soup2 = bs(r.content, 'lxml')
        review = ' '.join([i.text for i in soup2.select('p')])
        row = [title, link, rating, review]
        results.append(row)

url = 'https://www.mouthshut.com/product-reviews/ICICI-Lombard-Auto-Insurance-reviews-925641018-page-{}'
data = {'type': 'review', 'reviewid': '', 'catid': '925641018', 'corp': 'false', 'catname': ''}
results = []

with requests.Session() as s:
    r = s.get('https://www.mouthshut.com/product-reviews/ICICI-Lombard-Auto-Insurance-reviews-925641018')
    soup = bs(r.content, 'lxml')
    pages = int(soup.select('#spnPaging .btn-link')[-1].text)
    add_reviews(s, soup, results)
    if pages > 1:
        for page in range(2, pages + 1):
            r = s.get(url.format(page))
            soup = bs(r.content, 'lxml')
            add_reviews(s, soup, results)

df = pd.DataFrame(results, columns = ['Title', 'Link', 'Rating', 'Review'])
print(df)

I was trying to use the same code to extract data from flipkart site - https://www.flipkart.com/samsung-253-l-frost-free-double-door-3-star-convertible-refrigerator/product-reviews/itmf75fa1554bad3?pid=RFRFNDEEJ28SNQPG&lid=LSTRFRFNDEEJ28SNQPGEJ3YHJ&sortOrder=MOST_HELPFUL&certifiedBuyer=false&aid=overall&page=1 — user3415910, May 07 '21 at 15:03
Could you help me on extracting the data by clicking on "Read More" from the site — user3415910, May 07 '21 at 15:05
Used selenium to get the things done. Attached the link for more details — user3415910, May 10 '21 at 10:39

score 1 · Answer 2 · answered Jun 16 '19 at 11:52

There are two ways you can go about this. One way is using selenium. It allows you to control a browser programmatically (most common browsers, like Firefox and Chrome, are supported). I am not familiar with it, and it might be overkill in many situations (I imagine the browser will incur some overhead), but it's good to know.

Another way is to do some more inspection to see what's going on when you click the "Read More" button. The "Network" tab in the developer tools (I am using Chrome, but I think Firefox also has the same thing) can help with that by showing you all the HTTP requests the browser is sending.

I find that when you click the "Read More" button, a POST request is sent to https://www.mouthshut.com/review/CorporateResponse.ashx with the following data:

type: review
reviewid: 2836986
corp: false
isvideo: false
fbmessage: I found this review of ICICI Lombard Auto Insurance pretty useful
catid: 925641018
prodimg: .jpg
twittermsg: I found this review of ICICI Lombard Auto Insurance pretty useful %23WriteShareWin
twitterlnk: https://www.mouthshut.com/review/ICICI-Lombard-Auto-Insurance-review-rmlrrturotn
catname: ICICI Lombard Auto Insurance
rating_str:  1/5
usession: 0

However, when I just sent a POST request with those data, it didn't work. That usually means that there is something in the HTTP headers that matters. It is usually the cookie; I have confirmed that this is indeed the case. The solution is easy with the requests package (which you should totally use anyway): Use requests.Session.

Here is a proof of concept:

import requests
with requests.Session() as s:
    s.get('https://www.mouthshut.com/product-reviews/ICICI-Lombard-Auto-Insurance-reviews-925641018')
    print(s.post('https://www.mouthshut.com/review/CorporateResponse.ashx',
                 data = {'type': 'review', 'reviewid': '2836986', 'catid': '925641018', 'corp': 'false', 'catname': ''}
                ).text)

The result is some html containing what you are looking for. Enjoy souping!

score 0 · Answer 3 · answered May 10 '21 at 10:38

0

Some sites like Flipkart needs tools like Selenium to programmatically click the read more links. Here is a link for such implementation.

answered May 10 '21 at 10:38

user3415910

440
3
5
19

How to scrape data by triggering 'Read more' button

3 Answers3