1

I am trying to scrape the reviews from https://www.mouthshut.com/product-reviews/ICICI-Lombard-Auto-Insurance-reviews-925641018 using BeautifulSoup in python.

Actually the reviews content has a "Read more..." button. How can I trigger that button to fetch the whole content?

I found that a XHR request is fired when I click the button. How can I simulate that using python?

Also, after inspecting the "Read more..." button I got this:

<a style="cursor:pointer" onclick="bindreviewcontent('2836986',this,false,'I found this review of ICICI Lombard Auto Insurance pretty useful',925641018,'.jpg','I found this review of ICICI Lombard Auto Insurance pretty useful %23WriteShareWin','https://www.mouthshut.com/review/ICICI-Lombard-Auto-Insurance-review-rmlrrturotn','ICICI Lombard Auto Insurance',' 1/5','rmlrrturotn');">Read More</a>

How can I trigger the onclick event using python?

3 Answers3

2

Extracting all reviews with ratings and links

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd


def add_reviews(s, soup, results):
    for review in soup.select('.review-article'):
        info = review.select_one('a')
        identifier = review.select_one('[reviewid]')['reviewid']
        data['reviewid'] = identifier
        title = info.text
        link = info['href']
        rating = len(review.select('.rated-star'))
        r = s.post('https://www.mouthshut.com/review/CorporateResponse.ashx', data)
        soup2 = bs(r.content, 'lxml')
        review = ' '.join([i.text for i in soup2.select('p')])
        row = [title, link, rating, review]
        results.append(row)

url = 'https://www.mouthshut.com/product-reviews/ICICI-Lombard-Auto-Insurance-reviews-925641018-page-{}'
data = {'type': 'review', 'reviewid': '', 'catid': '925641018', 'corp': 'false', 'catname': ''}
results = []

with requests.Session() as s:
    r = s.get('https://www.mouthshut.com/product-reviews/ICICI-Lombard-Auto-Insurance-reviews-925641018')
    soup = bs(r.content, 'lxml')
    pages = int(soup.select('#spnPaging .btn-link')[-1].text)
    add_reviews(s, soup, results)
    if pages > 1:
        for page in range(2, pages + 1):
            r = s.get(url.format(page))
            soup = bs(r.content, 'lxml')
            add_reviews(s, soup, results)

df = pd.DataFrame(results, columns = ['Title', 'Link', 'Rating', 'Review'])
print(df)      
QHarr
  • 83,427
  • 12
  • 54
  • 101
1

There are two ways you can go about this. One way is using selenium. It allows you to control a browser programmatically (most common browsers, like Firefox and Chrome, are supported). I am not familiar with it, and it might be overkill in many situations (I imagine the browser will incur some overhead), but it's good to know.

Another way is to do some more inspection to see what's going on when you click the "Read More" button. The "Network" tab in the developer tools (I am using Chrome, but I think Firefox also has the same thing) can help with that by showing you all the HTTP requests the browser is sending.

I find that when you click the "Read More" button, a POST request is sent to https://www.mouthshut.com/review/CorporateResponse.ashx with the following data:

type: review
reviewid: 2836986
corp: false
isvideo: false
fbmessage: I found this review of ICICI Lombard Auto Insurance pretty useful
catid: 925641018
prodimg: .jpg
twittermsg: I found this review of ICICI Lombard Auto Insurance pretty useful %23WriteShareWin
twitterlnk: https://www.mouthshut.com/review/ICICI-Lombard-Auto-Insurance-review-rmlrrturotn
catname: ICICI Lombard Auto Insurance
rating_str:  1/5
usession: 0

However, when I just sent a POST request with those data, it didn't work. That usually means that there is something in the HTTP headers that matters. It is usually the cookie; I have confirmed that this is indeed the case. The solution is easy with the requests package (which you should totally use anyway): Use requests.Session.

Here is a proof of concept:

import requests
with requests.Session() as s:
    s.get('https://www.mouthshut.com/product-reviews/ICICI-Lombard-Auto-Insurance-reviews-925641018')
    print(s.post('https://www.mouthshut.com/review/CorporateResponse.ashx',
                 data = {'type': 'review', 'reviewid': '2836986', 'catid': '925641018', 'corp': 'false', 'catname': ''}
                ).text)

The result is some html containing what you are looking for. Enjoy souping!

Imperishable Night
  • 1,503
  • 9
  • 19
0

Some sites like Flipkart needs tools like Selenium to programmatically click the read more links. Here is a link for such implementation.

user3415910
  • 440
  • 3
  • 5
  • 19