1

I need to scrape car rankings from a number of websites.

For example:

https://www.kbb.com/articles/best-cars/10-best-used-cars-under-10000/

  1. 2011 Toyota Camry
  2. 2013 Honda Civic ...

https://www.autoguide.com/auto-news/2019/10/top-10-best-cars-for-snow.html

Dodge Charger AWD Subaru Outback Nissan Altima AWD ...

I'm having trouble detecting rankings on websites as they are all a bit different. My goal is basically to have a script that would automatically detect the ranking and retrieve the data I need (Brand + car model in the ranking) on any given car website with a reasonably high accuracy.

They data I want to collect (Brand + car model in the ranking) is sometimes in H2, H3 or H4, sometimes in links... Sometimes it's written as "1. Brand1 Model1, 2. Brand2 Model2..." Sometimes "Brand1 Model1, Brand2 Model2..." It depends...

I'm doing this in Python with BeautifulSoup.

What would be a good approach?

Edit:

To be clear I'm struggling to analyse the data, not to scrape it (see comments below). But to make it clear, here is how I handled the 1st example above:

for url in urls:
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "lxml")
        

    for sub_heading in soup.find_all('h2'): 
        if  str(1) + ". " in sub_heading.text and "11." not in sub_heading.text: #filter applied to keep only strings starting with "1. "
             list_url.append(url)
             print(list_sub_heading)

RESULT: ['1. 2011 Toyota Camry']

Zumplo
  • 150
  • 1
  • 9
  • Maybe, you may also use requests in your case. See this for details: https://stackoverflow.com/questions/60353109/how-to-simulate-a-button-click-in-a-request/60356159#60356159 – Demian Wolf Apr 24 '20 at 13:55
  • 1
    if the data is more dynamic and java script controlled, then selenium would be a better bet. – Sureshmani Kalirajan Apr 24 '20 at 13:58
  • 1
    Yes, I'm also using request. To be clear, I'm able to scrap the data most of time. But then I have trouble in the data to detect rankings out of all the data I collect. – Zumplo Apr 24 '20 at 13:59
  • @user1564140 I have taken a look at the first link. There is no a JSON-request that gets names of the cars, so the approach from the link cannot be used there. But maybe it can be used for the second link. – Demian Wolf Apr 24 '20 at 14:01
  • 1
    @Sureshmani that sounds interesting. Could you expand a bit? – Zumplo Apr 24 '20 at 14:02
  • Considering the Selenium, here is an article: https://www.guru99.com/introduction-to-selenium.html – Demian Wolf Apr 24 '20 at 14:12
  • This script will detect the cars' names as in the "For example" section (I used ; because SO doesn't support multiline code in comments, but you mustn't do it in your code in other cases!): ```from bs4 import BeautifulSoup; import urllib.request; soup = BeautifulSoup(urllib.request.urlopen("https://www.kbb.com/articles/best-cars/10-best-used-cars-under-10000/").read()); cars = [car_name.text for car_name in soup.find_all("div", class_="tdb-block-inner td-fix-index")[4].find_all("h3")] ``` – Demian Wolf Apr 24 '20 at 14:15
  • 1
    @DemianWolf thanks, that's pretty cool. This being said, my question is more, what's the right approach to develop a solution that works "all the time". Your solution works for example1, but not for ex2 – Zumplo Apr 24 '20 at 14:28
  • @user1564140 Thank you :) I can't understand what do you want to do exactly. Unfortunately, there can't be a solution for "all the time" and all websites, because different websites may have absolutely different HTML code. But I can give you for these two; their HTML codes are pretty similar (both use h3 elements inside of div elements; but div class names are different). Is that what you need? – Demian Wolf Apr 24 '20 at 14:46
  • 1
    @user1564140 check my below answer – αԋɱҽԃ αмєяιcαη Apr 24 '20 at 15:50

1 Answers1

1
import requests
from bs4 import BeautifulSoup


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    goal = [item.find_previous("h3").text for item in soup.findAll(
        "img", class_="alignnone")]
    mylist = list(dict.fromkeys(goal))
    print(mylist)


main("https://www.kbb.com/articles/best-cars/10-best-used-cars-under-10000/")

Output:

['1. 2011 Toyota Camry', '2. 2013 Honda Civic', '3. 2009 Toyota Avalon', '4. 2011 Honda Accord', '5. 2010 Toyota Prius', '6. 2012 Mazda Mazda3', '7. 2011 Toyota Corolla', '8. 2010 Subaru Outback', '9. 2013 Kia Soul', '10. 2012 Subaru Legacy']

re version:

import requests
import re


def main(url):
    r = requests.get(url)
    match = [f'{item.group(1)} {item.group(2)}'
             for item in re.finditer(r'>(\d+\.).+?>(.+?)<', r.text)]
    print(match)


main("https://www.kbb.com/articles/best-cars/10-best-used-cars-under-10000/")

Output:

['1. 2011 Toyota Camry', '2. 2013 Honda Civic', '3. 2009 Toyota Avalon', '4. 2011 Honda Accord', '5. 2010 Toyota Prius', '6. 2012 Mazda Mazda3', '7. 2011 Toyota Corolla', '8. 2010 Subaru Outback', '9. 2013 Kia Soul', '10. 2012 Subaru Legacy']
  • thanks, that's interesting. But the thing is this is very specific to one website. My question was more what's the right approach as I'm dealing with many different websites where I want to retrieve the same information (Car brands and models in rankings) – Zumplo Apr 26 '20 at 10:02