3

I am trying to develop a sustainable web scraping script to acquire a list of all products from a website. The product category links are in dropdowns (or expandable) elements on the webpage. I'm using PyQt5 to emulate a client before extracting the html and converting it to text with Beautiful Soup.

For instance, if you were visiting the site on your browser, you would have to click a button near the top-left corner of the page to open a category list that pops out from the left side of the screen (I will refer to this as the "side-bar"). Within each of those categories, when clicked, there is a list of more specific categories, each with a link that I am trying to acquire with my code (I will refer to these as "sub-categories").

The initial category list elements come up in my Beautiful Soup even if the side-bar is hidden, but the sub-category elements remain hidden unless the sub-category header expanded (thus, they don't show up on my soup). I have confirmed this by inspecting elements in a Chrome browser manually. Here is a snippet of the webpage HTML with my own comments to help explain:

<div aria-label="Fruits &amp; Vegetables" data-automation-id="taxonomy-toggle-Fruits &amp; Vegetables">
  <button aria-disabled="false" aria-expanded="false" class="NavSection__sectionBtn___1_cAs" data- 
   automation-id="nav-section-toggle" tabindex="-1"> #Initial category that contains sub-categories
  </button>
  <div>
  </div> #Contains the links I need, but doesn't populate HTML text unless sub-category element is expanded
</div>

Here's how it looks if the sub-category element has been expanded:

<div aria-label="Fruits &amp; Vegetables" data-automation-id="taxonomy-toggle-Fruits &amp; Vegetables">
      <button aria-disabled="true" aria-expanded="true" class="NavSection__sectionBtn___1_cAs" data- 
       automation-id="nav-section-toggle" tabindex="-1"> #Initial category that contains sub-categories
      </button>
      <div>
         <ul class>
           <li class = "NavSection__sectionLink__rbr40> </li>
           <li class = "NavSection__sectionLink__rbr40> </li> #can open each li element up to acquire href link
           <li class = "NavSection__sectionLink__rbr40> </li>
         </ul>
      </div>
</div>

And here's my code:

import bs4 as bs
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEnginePage

#act as a client via Qt5 to acquire javascript elements from webpage
class Page(QWebEnginePage):

    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.html = ''
        self.loadFinished.connect(self._on_load_finished)
        self.load(QUrl(url))
        self.app.exec_()

    def _on_load_finished(self):
        self.html = self.toHtml(self.callable)
        print("Load Finished")

    def callable(self, html_str):   
        self.html = html_str
        self.app.quit()

page = Page("https://grocery.walmart.com")
soup = bs.BeautifulSoup(page.html, 'lxml')
print(soup.prettify())

I know that if the aria-expanded and aria-disabled attributes of the <button> element are changed from "False" to "True" that the sub-category <li> elements will appear in the HTML. I confirmed this through manual inspection in Chrome browser.

My question is if it is possible to acquire the href from the <li> elements? My asumption is that I'd have to edit the HTML to change the aria attributes from "False" to "True" after an initial parse and then re-parse the HTML with those changes. If not, is there any other method to get these elements from the webpage other than Selenium? I'm trying to use a leaner approach (no opening of browser windows etc).

I can provide the actual website URL and a screenshot of the webpage to help clarify, not sure if that's considered good practice or allowed on Stack Overflow (I'm new here!).

For more background information on the method I'm trying to use, see the following:

Sentdex's PyQt4 Dynamic Scraping Video

PyQt4 to PyQt5 library changes

1 Answers1

2

If you download the HTML from the page you will see that almost the entire page is created using javascript so Beautiful Soup is not the right tool since it only serves to analyze HTML. In this case the solution is to implement the logic through javascript using the runJavaScript() method of QWebEnginePage:

from PyQt5 import QtCore, QtGui, QtWidgets, QtWebEngineWidgets


class WalmartGroceryPage(QtWebEngineWidgets.QWebEnginePage):
    def __init__(self, parent=None):
        super().__init__(parent)
        self._results = None
        self.loadFinished.connect(self._on_load_finished)
        self.setUrl(QtCore.QUrl("https://grocery.walmart.com"))

    @QtCore.pyqtSlot(bool)
    def _on_load_finished(self, ok):
        if ok:
            self.runJavaScript(
                """
                function scraper_script(){
                    var results = []
                    self.document.getElementById("mobileNavigationBtn").click();
                    var elements = document.getElementsByClassName("NavSection__sectionBtn___1_cAs");
                    for (const element of elements) {
                        element.click();
                        var items = [];
                        var sub_elements = document.getElementsByClassName("MobileNavigation__navLink___2-m6_");
                        for (const e of sub_elements) {
                            var d = {"name": e.innerText, "url": e.href};
                            items.push(d);
                        }
                        var data = {"name": element.innerText, "items": items};
                        results.push(data);
                    }
                    return results;
                }
                scraper_script();
                """,
                self.results_callback,
            )

    def results_callback(self, value):
        self._results = value
        QtCore.QCoreApplication.quit()

    @property
    def results(self):
        return self._results


if __name__ == "__main__":
    import sys
    import json

    # sys.argv.append("--remote-debugging-port=8000")
    app = QtWidgets.QApplication(sys.argv)

    page = WalmartGroceryPage()
    ret = app.exec_()
    results = page.results

    print(json.dumps(results, indent=4))

Output:

[
    {
        "items": [
            {
                "name": "Fall Flavors Shop",
                "url": "https://grocery.walmart.com/cp/Flavors%20of%20Fall/9576778812"
            },
            {
                "name": "Baking Center",
                "url": "https://grocery.walmart.com/browse?shelfId=3433056320"
            },
            {
                "name": "Peak Season Produce",
                "url": "https://grocery.walmart.com/browse?shelfId=4881154845"
            },
# ...
eyllanesc
  • 235,170
  • 19
  • 170
  • 241