4

i'm trying to scrape data from THIS WEBSITE that have 3 kind of prices in some products, (muted price, red price and black price), i observed that the red price change before the page load when the product have 3 prices.

When i scrape the website i get just two prices, i think if the code wait until the page fully load i will get all the prices.

Here is my code:

url='https://www.exito.com/televisor-led-samsung-55-pulgadas-uhd-4k-smart-tv-serie-7-24449/p'
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")

# Muted Price
MutedPrice = soup.find_all("span",{'class':'exito-vtex-components-2-x-listPriceValue ph2 dib strike custom-list-price fw5 exito-vtex-component-precio-tachado'})[0].text
MutedPrice=pd.to_numeric(MutedPrice[2-len(MutedPrice):].replace('.',''))

# Red Price
RedPrice = soup.find_all("span",{'class':'exito-vtex-components-2-x-sellingPrice fw1 f3 custom-selling-price dib ph2 exito-vtex-component-precio-rojo'})[0].text
RedPrice=pd.to_numeric(RedPrice[2-len(RedPrice):].replace('.',''))

# black Price
BlackPrice = soup.find_all("span",{'class':'exito-vtex-components-2-x-alliedPrice fw1 f3 custom-selling-price dib ph2 exito-vtex-component-precio-negro'})[0].text
BlackPrice=pd.to_numeric(BlackPrice[2-len(BlackPrice):].replace('.',''))

print('Muted Price:',MutedPrice)
print('Red Price:',RedPrice)
print('Black Price:',BlackPrice)

Actual Results: Muted Price: 3199900 Red Price: 1649868 Black Price: 0

Expected Results: Muted Price: 3199900 Red Price: 1550032 Black Price: 1649868

2 Answers2

2

It might be that those values are rendered dynamically i.e. the values might be populated by javascript in the page.

requests.get() simply returns the markup received from the server without any further client-side changes so it's not fully about waiting.

You could perhaps use Selenium Chrome Webdriver to load the page URL and get the page source. (Or you can use Firefox driver).

Go to chrome://settings/help check your current chrome version and download the driver for that version from here. Make sure to either keep the driver file in your PATH or the same folder where your python script is.

Try replace top 3 lines of your existing code with this:

from contextlib import closing
from selenium.webdriver import Chrome # pip install selenium

url='https://www.exito.com/televisor-led-samsung-55-pulgadas-uhd-4k-smart-tv-serie-7-24449/p'

# use Chrome to get page with javascript generated content
with closing(Chrome(executable_path="./chromedriver")) as browser:
     browser.get(url)
     page_source = browser.page_source

soup = BeautifulSoup(page_source, "lxml")

Outputs:

Muted Price: 3199900
Red Price: 1550032
Black Price: 1649868

References:

Get page generated with Javascript in Python

selenium - chromedriver executable needs to be in PATH

Rithin Chalumuri
  • 1,739
  • 7
  • 19
  • 1
    Thanks for your quick response, the code works perfect, i just have to see if works in arround 11.000 pages :) – Fabio Salinas Nov 03 '19 at 02:09
  • @FabioSalinas, you would probably need the browser to be running in background without opening. Atm, the code opens a new browser window. You can refer this to make it pure headless https://sqa.stackexchange.com/a/34401 :) – Rithin Chalumuri Nov 03 '19 at 02:13
  • Thanks again for your help, now i am runing the code for arround 13.000 pages and the preformance now works better runing it without opening the browser and avoid loading images, but even with that i have to wait arount 60 hours to scrape all the data, so i'm thinking how can i optimize my code to run it every day. – Fabio Salinas Nov 11 '19 at 14:34
0

The page you are trying to scrape contains JavaScript code, which is executed by your browser and modifies the page after it is downloaded. If you want to perform extractions on the "final state" of the page, you need to run the JavaScript code on the page using a library dedicated to that. Unfortunately, BeautifulSoup does not have this functionality, and you will need to use another library to achieve your task.

For example, you can pip install requests-html and run the following:

#!/usr/bin/env python3

import re
from requests_html import HTMLSession

def parse_price_text(price_text):
    """Extract just the price digits and dots from the <span> tag text"""
    matches = re.search("([\d\.]+)", price_text)
    if not matches:
        raise RuntimeError(f"Could not parse price text: {price_text}")

    return matches.group(1)

# Starting a session and running the JavaScript code with render()
# to make sure the DOM is the same as when using the browser.
session = HTMLSession()
exito_url = "https://www.exito.com/televisor-led-samsung-55-pulgadas-uhd-4k-smart-tv-serie-7-24449/p"
response = session.get(exito_url)
response.html.render()

# Define all price types and their associated CSS class
price_types = {
    "listPrice": "exito-vtex-components-2-x-listPriceValue",
    "sellingPrice": "exito-vtex-components-2-x-sellingPrice",
    "alliedPrice": "exito-vtex-components-2-x-alliedPrice"
}

# Iterate over price types and extract them from the page
for price_type, price_css_class in price_types.items():
    price = parse_price_text(response.html.find(f"span.{price_css_class}", first=True).text)
    print(f"{price_type} price: {price} $")

It prints the following:

listPrice price: 3.199.900 $
sellingPrice price: 1.550.032 $
alliedPrice price: 1.649.868 $
Pierre
  • 1,068
  • 1
  • 9
  • 13
  • Thanks for your quick response!. When i run the code shows this: RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead. – Fabio Salinas Nov 03 '19 at 02:07
  • Make sure to run this code from the terminal with the Python interpreter directly: `python exito.py`, `exito.py` being the code I gave you. I suspect you are using some additional software that has an event loop (presumably Jupyter?) and conflicts with `requests-html`. – Pierre Nov 03 '19 at 12:23
  • Thank you, that's ritgh, i'm runing the code on a jupiter notebook, i will try it as you said on a .py file. Thanks again for your help. – Fabio Salinas Nov 05 '19 at 23:29