1

I'm new to python and have managed to get this far trying the HTML Parser, but I'm stuck on how to get pagination for the reviews at the bottom of the page to work for the site.

The URL is in the PasteBin code, I am leaving out the URL in this thread for privacy reasons.

Any help is much appreciated.

# Reviews Scrape

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'EXAMPLE.COM'

# opening up connection, grabbing, the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# HTML Parsing
page_soup = soup(page_html, "html.parser")

# Grabs each review
reviews = page_soup.findAll("div",{"class":"jdgm-rev jdgm-divider-top"})

filename = "compreviews.csv"
f = open(filename, "w")

headers = "Score, Title, Content\n"

f.write(headers)
# HTML Lookup Location per website and strips spacing
for container in reviews:
    # score = container.div.div.span["data-score"]
    score = container.findAll("span",{"data-score":True})
    user_score = score[0].text.strip()

    title_review = container.findAll("b",{"class":"jdgm-rev__title"})
    user_title = title_review[0].text.strip()

    content_review = container.findAll("div",{"class":"jdgm-rev__body"})
    user_content = content_review[0].text.strip()

    print("user_score:" + score[0]['data-score'])
    print("user_title:" + user_title)
    print("user_content:" + user_content)

    f.write(score[0]['data-score'] + "," +user_title + "," +user_content + "\n")

f.close()
Northwoods
  • 15
  • 3

1 Answers1

0

The page does an xhr GET request using a query string to get results. This query string has parameters for reviews per page and page number. You can make an initial request with what seems like the max reviews per page of 31, extract the html from the json returned then grab the page count; write a loop to run over all pages getting results. Example construct below:

import requests
from bs4 import BeautifulSoup as bs

start_url = 'https://urlpart&page=1&per_page=31&product_id=someid'

with requests.Session() as s:
    r = s.get(start_url).json()
    soup = bs(r['html'], 'lxml')
    print([i.text for i in soup.select('.jdgm-rev__author')])
    print([i.text for i in soup.select('.jdgm-rev__title')])
    total_pages = int(soup.select_one('.jdgm-paginate__last-page')['data-page'])

    for page in range(2, total_pages + 1):
        r = s.get(f'https://urlpart&page={page}&per_page=31&product_id=someid').json()
        soup = bs(r['html'], 'lxml')
        print([i.text for i in soup.select('.jdgm-rev__author')])
        print([i.text for i in soup.select('.jdgm-rev__title')]) #etc

Example dataframe to csv

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

start_url = 'https://urlpart&page=1&per_page=31&product_id=someid'

authors = []
titles = []

with requests.Session() as s:
    r = s.get(start_url).json()
    soup = bs(r['html'], 'lxml')
    authors.extend([i.text for i in soup.select('.jdgm-rev__author')])
    titles.extend([i.text for i in soup.select('.jdgm-rev__title')])
    total_pages = int(soup.select_one('.jdgm-paginate__last-page')['data-page'])

    for page in range(2, total_pages + 1):
        r = s.get(f'https://urlpart&page={page}&per_page=31&product_id=someid').json()
        soup = bs(r['html'], 'lxml')
        authors.extend([i.text for i in soup.select('.jdgm-rev__author')])
        titles.extend([i.text for i in soup.select('.jdgm-rev__title')]) #etc

headers = ['Author','Title']
df = pd.DataFrame(zip(authors,titles), columns = headers)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8',index = False )
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • You can integrate your code. I would stick with requests library and using Session. You also want the loop structure. You can use your code for extracting results for author etc as you will have the soup object. Consider whether you add to lists during the loop to construct a dataframe at the end and write to csv. You could use pandas for this, my preference, or stick with csv and write out in rows (this feels a little slow) – QHarr Dec 27 '19 at 01:25
  • Basically replace my print statements with code you want for extracting required info and constructing your final output. – QHarr Dec 27 '19 at 01:25
  • OK, thank you I'm new to Python so I will take your code and do some more research and see if I can get it to work. Thanks again – Northwoods Dec 27 '19 at 01:33
  • It's ok. It is likely someone will answer and make it closer to your code. The above just shows you another way. I do note that total reviews returned is actually a little less than indicated on page but I believe it returns the correct number. – QHarr Dec 27 '19 at 01:34
  • Can you tell me how you found that URL that has the xhr GET? – Northwoods Jan 06 '20 at 18:54
  • I used the browser network tab e.g. [1](https://stackoverflow.com/a/56279841/6241235) and [2](https://stackoverflow.com/a/56924071/6241235) – QHarr Jan 06 '20 at 19:20