6

I'm learning how to build another scraper for another website, Reverb.com, after getting my scraper on another website to work properly. Reverb, however, has been more challenging to extract information from and the model with my old scraper isn't working the same. I did some research and using requests_html instead of requests seemed like the option most were using for Javascript like what Reverb.com has.

I'm essentially trying to scrape out text versions of the headline and price information and either paginate through the different pages or loop through a list of URLs to get all the content. I'm sort of there but hitting road blocks. Below are 2 versions of code I'm fiddling with.

The first version below prints out all of what looks like only 3 of many pages of content but it prints out all the instrument names and prices with the markup. In the CSV, however, all of those items are printed together on 3 rows only, not 1 item/price pair per row.

from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent


session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")

#content scrape
b = soup.findAll("h4", class_="grid-card__title") #title
for i in b:
    print(i)


p = soup.findAll("div", class_="grid-card__price") #price
for i in p:
    print(i)

Conversely, this version prints out 3 lines only to a CSV but the name and price are stripped of all the markup. But it only happens when I changed the findAll to just find. I read that the for html in r.html was a way to loop through pages without having to make a list of urls.

from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent


#make csv file
csv_file = open("rvscrape.csv", "w", newline='') #added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name","bass_price"])

session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")

for html in r.html:
    #content scrape
    bass_name = []
    b = soup.find("h4", class_="grid-card__title").text.strip() #title
    #for i in b:
    #    bass_name.append(i)
    #    for i in bass_name:
    #        print(i)

    price = []
    p = soup.find("div", class_="grid-card__price").text.strip() #price
    #for i in p:
    #    print(i)

    csv_writer.writerow([b, p])
10VA
  • 61
  • 1
  • 6

1 Answers1

0

In order to extract all the pages of search results, you need to extract the link of the next page and keep going until there is no next page available. We can do this using a while loop and checking the existence of the next anchor tag. The following script performs the loop and also adds the results to the csv. It also prints the url of the page, so that we have an estimate of what page the program is on.

from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent


# make csv file
# added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_file = open("rvscrape.csv", "w", newline='')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name", "bass_price"])

session = HTMLSession()
r = session.get(
    "https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)

stop = False
next_url = ""
while not stop:
    print(next_url)
    soup = BeautifulSoup(r.html.raw_html, "html.parser")

    titles = soup.findAll("h4", class_="grid-card__title")  # titles
    prices = soup.findAll("div", class_="grid-card__price")  # prices

    for i in range(len(titles)):
        title = titles[i].text.strip()
        price = prices[i].text.strip()
        csv_writer.writerow([title, price])

    next_link = soup.find("li", class_="pagination__page--next")
    if not next_link:
        stop = True
    else:
        next_url = next_link.find("a").get("href")
        r = session.get("https://reverb.com/marketplace" + next_url)
        r.html.render(sleep=5)

Such data output schema issues are highly common for target javascript websites. This can be also solved using dynamic scrapers.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
Gidoneli
  • 123
  • 8