3

I am trying to use python to scrape the names of restaurants from a website. I'm having a hard time figuring out which exact div class to target and then how to write the code to do the scraping. I have successfully written the code for other webpages but can't figure it out for this one.

I am targetting this webpage: https://www.broadsheet.com.au/melbourne/fitzroy

Here is what I have tried:

soup_rest_list = BeautifulSoup(html_rest, 'html.parser')
type(soup_rest_list)

rest_container = soup_rest_list.find_all(class_="venue-teaser-list format-horizontal VenueTeaserListWrapper-sc-13dcca9-1 fIcGQi", "h2", class_="venue-title")

I'm not getting much love though. Right now when I execute my code I just get a []

Any help greatly appreciated.

deadant88
  • 920
  • 9
  • 24
  • Could you add a [Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example)? Like a part of the html doc, and the desired output... – MrNobody33 Jul 05 '20 at 04:06

2 Answers2

0

Using find_all, you just have to look for tags h2 within class venue-title, and then extract its text attribute.

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')

driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)


url = 'https://www.broadsheet.com.au/melbourne/fitzroy'
driver.get(url)

page = BeautifulSoup(driver.page_source, 'html')
elements = page.find_all("h2", class_="venue-title")
names = [i.text for i in elements]

>> names
 
['Poodle Bar & Bistro',
 'Gogyo',
 'Rice Queen',
 'Vegie Bar',
 'Smith & Daughters',
 'Belles Hot Chicken Fitzroy',
 'Grub Fitzroy',
 'Archie’s All Day',
 'Sonido',
 'Gabriel',
 'Mile End Bagels',
 'Napier Quarter',
 'Bonny',
 'Near & Far',
 'The Everleigh',
 "Milney's",
 'Mono-XO',
 'The Rum Diary Bar',
 'Smith & Deli',
 'Meatsmith Fitzroy',
 'American Vintage',
 'Hunter Gatherer',
 'Plane',
 'Aesop']
Renato Aranha
  • 300
  • 1
  • 10
  • Thank you for your response, could you explain how driver.page_source works ? – deadant88 Jul 05 '20 at 05:14
  • Sure. It is the webdriver from Selenium package. It gathers whole html from page. I've added to the code. For Selenium documentation, you can check https://selenium-python.readthedocs.io/api.html. Hope it helps. – Renato Aranha Jul 05 '20 at 05:24
  • Thank you so much for your response. I’ve heard of Selenium but so far have only used Beautiful Soup, I’m quite new to this. Is there an optimal way of using the two together? – deadant88 Jul 05 '20 at 06:18
0

First off, if you actually tried what you tried, i.e.

rest_container = soup_rest_list.find_all(class_="venue-teaser-list format-horizontal VenueTeaserListWrapper-sc-13dcca9-1 fIcGQi", "h2", class_="venue-title")

Python would have reported a syntax error rather than assigning [] to rest_container, as 1) "h2" is a positional argument that came after class_, and then 2) class_ was specified a second time as a keyword argument.

What you are probably looking for is the CSS selector functionality, which will let you select the elements within a set of elements like you wanted by specifying the equivalent CSS selector rule:

>>> soup_rest_list.select("div.venue-teaser-list.format-horizontal.VenueTeaserListWrapper-sc-13dcca9-1.fIcGQi h2.venue-title")
[<h2 class="venue-title">...]
metatoaster
  • 17,419
  • 5
  • 55
  • 66
  • Thanks so much, yes I stupidly added that additional “class” into the code posted here, not sure why I thought it would be helpful. I feel a bit over my head. I couldn’t find the type of code you have written in BS documentation, do you mind explaining how it works a bit? If it’s too big of a task, don’t worry. I am still learning and I appreciate you taking the time to respond in the first place. – deadant88 Jul 05 '20 at 05:17
  • Did you open the link where I specified [CSS selector](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors) which took you directly to the relevant section in the BS documentation? It has a whole slew of examples and from there a link to [Soup Seive](https://facelessuser.github.io/soupsieve/) which this feature was derived from? – metatoaster Jul 05 '20 at 06:21
  • I did, yes. Thanks for the pointers, it's beginning to make more sense. I do appreciate it. Can I ask, why do you use "." in your code between eg: div.venue ? – deadant88 Jul 05 '20 at 09:33
  • In your question, you wanted to select multiple classes at the same time - my answer selected those same classes through the [CSS class selector](https://developer.mozilla.org/en-US/docs/Web/CSS/Class_selectors) following the syntax for [selecting the element with multiple classes](https://stackoverflow.com/questions/2554839/select-element-based-on-multiple-classes). Though in retrospect I didn't need to start with `div` as your question didn't specify the `div` tag. – metatoaster Jul 05 '20 at 10:08
  • Your advice is working in that I can now successfully grab all of the h2 headers on the page, however I am trying to limit the headers the script grabs to just "restaurants". I have tried `div.layout-block guide-venue-section h2.venue-title` based off the link's css `
    ` because that seems to be a more precise reference to just the restaurants. However it is not returning the h2 titles like the code you've provided does. I'm not sure how to incorporate the css `id="restaurants"` into the bs select script.
    – deadant88 Jul 05 '20 at 12:18
  • 1
    If you look at the side bar for the CSS class selectors link you may have noticed [ID selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/ID_selectors) - from that you could have tried `soup_rest_list.select("#restaurants h2.venue-title")` and see that your results are found. Please take more effort in reading the documentation carefully and going through relevant links to find the answers yourself, as part of being a software developer involves reading a lot of documentation - it will save you time waiting for people writing a custom answer just for you (if they are nice enough to). – metatoaster Jul 05 '20 at 12:43
  • 1
    You may also wish to consult the [CSS documentation](https://developer.mozilla.org/en-US/docs/Web/CSS), which is also linked from the breadcrumb on top of those documentation pages. On there you may find the various help ages and the tutorials useful, not just for CSS but for other related web technologies. – metatoaster Jul 05 '20 at 13:00
  • Thank you appreciate your help and advice. You have helped solve my issue. – deadant88 Jul 06 '20 at 02:19