I need to scrape car rankings from a number of websites.
For example:
https://www.kbb.com/articles/best-cars/10-best-used-cars-under-10000/
- 2011 Toyota Camry
- 2013 Honda Civic ...
https://www.autoguide.com/auto-news/2019/10/top-10-best-cars-for-snow.html
Dodge Charger AWD Subaru Outback Nissan Altima AWD ...
I'm having trouble detecting rankings on websites as they are all a bit different. My goal is basically to have a script that would automatically detect the ranking and retrieve the data I need (Brand + car model in the ranking) on any given car website with a reasonably high accuracy.
They data I want to collect (Brand + car model in the ranking) is sometimes in H2, H3 or H4, sometimes in links... Sometimes it's written as "1. Brand1 Model1, 2. Brand2 Model2..." Sometimes "Brand1 Model1, Brand2 Model2..." It depends...
I'm doing this in Python with BeautifulSoup.
What would be a good approach?
Edit:
To be clear I'm struggling to analyse the data, not to scrape it (see comments below). But to make it clear, here is how I handled the 1st example above:
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
for sub_heading in soup.find_all('h2'):
if str(1) + ". " in sub_heading.text and "11." not in sub_heading.text: #filter applied to keep only strings starting with "1. "
list_url.append(url)
print(list_sub_heading)
RESULT: ['1. 2011 Toyota Camry']