1

I'm trying to scrape this page (for example): https://super.walmart.com.mx/papel-higienico/papel-higienico-petalo-rendimax-12-rollos-con-320-hojas-dobles/00750194345845

... to get product price. With requests-html I can get dynamic content in other pages, but it is not working on wallmart. I know I can do this with selenium, but I'm trying to understand why it is not working with requests-html, and (if possible) how can I do it.

This is my current code:

import requests_html as rh

session = rh.HTMLSession()
r = session.get("https://super.walmart.com.mx/papel-higienico/papel-higienico-petalo-rendimax-12-rollos-con-320-hojas-dobles/00750194345845")

r.html.render()
r.html.find('.main-content_rightContainer__3_cSi',first=True).text
  • 2
    The contents of that URL are loaded by JavaScript after the initial HTML (which is empty, except for the JavaScript code) has been loaded. If you look at the actual source code of the HTML you are getting (e.g. for Chrome and Firefox `view-source:https://super.walmart.com.mx/papel-higienico/papel-higienico-petalo-rendimax-12-rollos-con-320-hojas-dobles/00750194345845`), you will see there are only `script` tags and one `div` with id `root` where the contents are later loaded. You would need to process the JavaScript from Python in order to get the contents you want. – jdehesa Nov 28 '19 at 18:26
  • Note that web scraping is also not allowed by the terms of that website (see section 2 of [terms and conditions](https://super.walmart.com.mx/contenido/terminos-y-condiciones), "(...) entendiendo como uso indebido (...) La utilización de mecanismos o herramientas automatizadas o tecnología similar cuya finalidad sea realizar la extracción, obtención o recopilación, directa o indirecta, de cualquier información contenida en el sitio"). – jdehesa Nov 28 '19 at 18:28
  • @jdehesa the .render method is exaclty for loading JS content, as you could read on the so called duplicated question. Yet, the content is not being rendered, and I'd like to know why. ... as for the terms and conditions, I hadn't read it. However, their API is not accepting any new members. – Ricardo Fernandes Campos Nov 28 '19 at 18:40
  • I see, sorry, I didn't realize you could do it with `requests_html` alone. Have you tried fiddling with the parameters to [`render`](https://requests-html.kennethreitz.org/#requests_html.HTML.render)? Maybe increasing the `wait`, since that page takes a bit to load. You may also give a try to PhantomJS or some of the other options in the other question, at least to see if they work. – jdehesa Nov 28 '19 at 18:49
  • I just tried increasing wait, adding sleep and scrolling down. Didn't work. =( – Ricardo Fernandes Campos Nov 28 '19 at 19:00

0 Answers0