I am trying to develop a sustainable web scraping script to acquire a list of all products from a website. The product category links are in dropdowns (or expandable) elements on the webpage. I'm using PyQt5 to emulate a client before extracting the html and converting it to text with Beautiful Soup.
For instance, if you were visiting the site on your browser, you would have to click a button near the top-left corner of the page to open a category list that pops out from the left side of the screen (I will refer to this as the "side-bar"). Within each of those categories, when clicked, there is a list of more specific categories, each with a link that I am trying to acquire with my code (I will refer to these as "sub-categories").
The initial category list elements come up in my Beautiful Soup even if the side-bar is hidden, but the sub-category elements remain hidden unless the sub-category header expanded (thus, they don't show up on my soup). I have confirmed this by inspecting elements in a Chrome browser manually. Here is a snippet of the webpage HTML with my own comments to help explain:
<div aria-label="Fruits & Vegetables" data-automation-id="taxonomy-toggle-Fruits & Vegetables">
<button aria-disabled="false" aria-expanded="false" class="NavSection__sectionBtn___1_cAs" data-
automation-id="nav-section-toggle" tabindex="-1"> #Initial category that contains sub-categories
</button>
<div>
</div> #Contains the links I need, but doesn't populate HTML text unless sub-category element is expanded
</div>
Here's how it looks if the sub-category element has been expanded:
<div aria-label="Fruits & Vegetables" data-automation-id="taxonomy-toggle-Fruits & Vegetables">
<button aria-disabled="true" aria-expanded="true" class="NavSection__sectionBtn___1_cAs" data-
automation-id="nav-section-toggle" tabindex="-1"> #Initial category that contains sub-categories
</button>
<div>
<ul class>
<li class = "NavSection__sectionLink__rbr40> </li>
<li class = "NavSection__sectionLink__rbr40> </li> #can open each li element up to acquire href link
<li class = "NavSection__sectionLink__rbr40> </li>
</ul>
</div>
</div>
And here's my code:
import bs4 as bs
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEnginePage
#act as a client via Qt5 to acquire javascript elements from webpage
class Page(QWebEnginePage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebEnginePage.__init__(self)
self.html = ''
self.loadFinished.connect(self._on_load_finished)
self.load(QUrl(url))
self.app.exec_()
def _on_load_finished(self):
self.html = self.toHtml(self.callable)
print("Load Finished")
def callable(self, html_str):
self.html = html_str
self.app.quit()
page = Page("https://grocery.walmart.com")
soup = bs.BeautifulSoup(page.html, 'lxml')
print(soup.prettify())
I know that if the aria-expanded
and aria-disabled
attributes of the <button>
element are changed from "False" to "True" that the sub-category <li>
elements will appear in the HTML. I confirmed this through manual inspection in Chrome browser.
My question is if it is possible to acquire the href
from the <li>
elements? My asumption is that I'd have to edit the HTML to change the aria
attributes from "False" to "True" after an initial parse and then re-parse the HTML with those changes. If not, is there any other method to get these elements from the webpage other than Selenium? I'm trying to use a leaner approach (no opening of browser windows etc).
I can provide the actual website URL and a screenshot of the webpage to help clarify, not sure if that's considered good practice or allowed on Stack Overflow (I'm new here!).
For more background information on the method I'm trying to use, see the following: