2

I was trying to scrape a website for some university project. The website is https://www.bonprix.it/prodotto/leggings-a-pinocchietto-pacco-da-2-leggings-a-pinocchietto-pacco-da-2-bianco-nero-956015/?itemOptionId=12211813. I have a problem with my python code. What I want to obtain is all the reviews for the pages from 1 to 5, but instead I get all [].Any help would be appreciated!

Here is the code:

import csv
from bs4 import BeautifulSoup
import urllib.request
import re
import pandas as pd
import requests
reviewlist = []
class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

opener = AppURLopener()
response = opener.open('https://www.bonprix.it/prodotto/leggings-a-pinocchietto-pacco-da-2-leggings-a-pinocchietto-pacco-da-2-bianco-nero-956015/?itemOptionId=12211813')

soup = BeautifulSoup(response,'html.parser')

reviews = soup.find_all('div',{'class':'reviewContent'})


for i in reviews:
    review = {

        'per_review_name' : i.find('span',{'itemprop':'name'}).text.strip(),
        'per_review' : i.find('p',{'class':'reviewText'}).text.strip(),
        'per_review_taglia' : i.find('p',{'class':'singleReviewSizeDescr'}).text.strip(),
        
    }
    reviewlist.append(review)
   
for page in range (1,5):
    prova = soup.find_all('div',{'data-page': '{page}'})
    print(prova)
    print(len(reviewlist))
        
df = pd.DataFrame(reviewlist)
df.to_csv('list.csv',index=False)
print('Fine.')

And here the output that I get:

[]
5
[]
5
[]
5
[]
5
Fine.

2 Answers2

2

As I understand it the site uses Javascript to load most of its content, therfore you cant scrape that data, as it isn't loaded initially, but you can use the rating backend for your product site the link is:

https://www.bonprix.it/reviews/list/?styleId=31436999&sortby=date&page=1&rating=0&variant=0&size=0&bodyHeight=0&showOldReviews=true&xxl=false&variantFilters=

You can go through the pages by changing the page parameter in the url/get request, the link returns a html document of the rating page an you can get the rating from the rating value meta tag

knpfl
  • 91
  • 2
  • Thank you very much for your help! Now my code works perfectly well. What make me curious is how you get the link? My original link didn't contain the page number. Is there a way to do that in general? Thanks in advance. – Clara Puglisi Jun 26 '21 at 11:29
  • I think you can add the product id as parameter or get the link from the network tab in developer mode in the browser when using the xhr filter – knpfl Jun 27 '21 at 12:17
  • I tried it, but I don't know how to use xhr filter (I checked Request URL but nothing coincide with the url suggested by you). Any suggestion would be very helpful. – Clara Puglisi Jun 28 '21 at 15:22
0

The website only loads first page of the reviews in the first request. If you inspect its requests, you can see that it requests for additional data when you change the page of the reviews. You can rewrite your code as following to get the reviews from all pages:

reviews_dom = []
for page in range(1,6):
    url = f"https://www.bonprix.it/reviews/list/?styleId=31436999&sortby=date&page={page}&rating=0&variant=0&size=0&bodyHeight=0&showOldReviews=true&xxl=false&variantFilters="
    r = requests.request("GET", url)
    soup = BeautifulSoup(r.text, "html.parser")
    reviews_dom += soup.find_all("div", attrs={"class": "reviewContent"})
    
reviews = []
for review_item in reviews_dom:
    review = {
        'per_review_name' : review_item.find('span', attrs={'itemprop':'name'}).text.strip(),
        'per_review' : review_item.find('p', attrs={'class':'reviewText'}).text.strip(),
        'per_review_taglia' : review_item.find('p', attrs={'class':'singleReviewSizeDescr'}).text.strip(),
    }
    reviews.append(review)
    
print(len(reviews))
print(reviews)

What happens in the code?

In the first iteration, we request the data for each page of reviews (first 5 pages in the above example).

In the second iteration, we parse the reviews dom and extract the data we need.

Saeed Esmaili
  • 764
  • 3
  • 12
  • 34