Web Scryping in Python

Question

I was trying to scrape a website for some university project. The website is https://www.bonprix.it/prodotto/leggings-a-pinocchietto-pacco-da-2-leggings-a-pinocchietto-pacco-da-2-bianco-nero-956015/?itemOptionId=12211813. I have a problem with my python code. What I want to obtain is all the reviews for the pages from 1 to 5, but instead I get all [].Any help would be appreciated!

Here is the code:

import csv
from bs4 import BeautifulSoup
import urllib.request
import re
import pandas as pd
import requests
reviewlist = []
class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

opener = AppURLopener()
response = opener.open('https://www.bonprix.it/prodotto/leggings-a-pinocchietto-pacco-da-2-leggings-a-pinocchietto-pacco-da-2-bianco-nero-956015/?itemOptionId=12211813')

soup = BeautifulSoup(response,'html.parser')

reviews = soup.find_all('div',{'class':'reviewContent'})


for i in reviews:
    review = {

        'per_review_name' : i.find('span',{'itemprop':'name'}).text.strip(),
        'per_review' : i.find('p',{'class':'reviewText'}).text.strip(),
        'per_review_taglia' : i.find('p',{'class':'singleReviewSizeDescr'}).text.strip(),
        
    }
    reviewlist.append(review)
   
for page in range (1,5):
    prova = soup.find_all('div',{'data-page': '{page}'})
    print(prova)
    print(len(reviewlist))
        
df = pd.DataFrame(reviewlist)
df.to_csv('list.csv',index=False)
print('Fine.')

And here the output that I get:

[]
5
[]
5
[]
5
[]
5
Fine.

Note that `'{page}'` is literally the string `'{page}'`, _not_ an f-string — ForceBru, Jun 25 '21 at 14:09
Cannot test at the moment but try `prova = soup.find_all('div',{'data-page': f'{page}'})`. Note the `f` prefix (aka f-strings). — , Jun 25 '21 at 14:09
Thank you very much for the prompt answer, but still doesn't work. — Clara Puglisi, Jun 25 '21 at 14:37

score 2 · Answer 1 · answered Jun 25 '21 at 15:12

2

As I understand it the site uses Javascript to load most of its content, therfore you cant scrape that data, as it isn't loaded initially, but you can use the rating backend for your product site the link is:

https://www.bonprix.it/reviews/list/?styleId=31436999&sortby=date&page=1&rating=0&variant=0&size=0&bodyHeight=0&showOldReviews=true&xxl=false&variantFilters=

You can go through the pages by changing the page parameter in the url/get request, the link returns a html document of the rating page an you can get the rating from the rating value meta tag

answered Jun 25 '21 at 15:12

knpfl

91
2

Thank you very much for your help! Now my code works perfectly well. What make me curious is how you get the link? My original link didn't contain the page number. Is there a way to do that in general? Thanks in advance. – Clara Puglisi Jun 26 '21 at 11:29
I think you can add the product id as parameter or get the link from the network tab in developer mode in the browser when using the xhr filter – knpfl Jun 27 '21 at 12:17
I tried it, but I don't know how to use xhr filter (I checked Request URL but nothing coincide with the url suggested by you). Any suggestion would be very helpful. – Clara Puglisi Jun 28 '21 at 15:22

score 0 · Accepted Answer · answered Jun 25 '21 at 15:37

The website only loads first page of the reviews in the first request. If you inspect its requests, you can see that it requests for additional data when you change the page of the reviews. You can rewrite your code as following to get the reviews from all pages:

reviews_dom = []
for page in range(1,6):
    url = f"https://www.bonprix.it/reviews/list/?styleId=31436999&sortby=date&page={page}&rating=0&variant=0&size=0&bodyHeight=0&showOldReviews=true&xxl=false&variantFilters="
    r = requests.request("GET", url)
    soup = BeautifulSoup(r.text, "html.parser")
    reviews_dom += soup.find_all("div", attrs={"class": "reviewContent"})
    
reviews = []
for review_item in reviews_dom:
    review = {
        'per_review_name' : review_item.find('span', attrs={'itemprop':'name'}).text.strip(),
        'per_review' : review_item.find('p', attrs={'class':'reviewText'}).text.strip(),
        'per_review_taglia' : review_item.find('p', attrs={'class':'singleReviewSizeDescr'}).text.strip(),
    }
    reviews.append(review)
    
print(len(reviews))
print(reviews)

What happens in the code?

In the first iteration, we request the data for each page of reviews (first 5 pages in the above example).

In the second iteration, we parse the reviews dom and extract the data we need.

You can find the link with inspecting the requests using your browser's inspect elements. You may also accept the answer if it solves your problem. @ClaraPuglisi — Saeed Esmaili, Jun 27 '21 at 12:30

Web Scryping in Python

2 Answers2

What happens in the code?