2

I'm trying to scrape the price of a product. Here's my code:

from bs4 import BeautifulSoup as soup
import requests

page_url = "https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/"
headers={
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
uClient = requests.get(page_url, headers=headers)
print(uClient)
page_soup = soup(uClient.content, "html.parser") #requests
test = page_soup.find("p", {"class":"fb-price"})
print(test)

But I get the following response instead of the desired price

<Response [200]>
None

I have checked that the element exists using Chrome developer tools. URL: https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/

ggorlen
  • 44,755
  • 7
  • 76
  • 106
Ricardo Mehr
  • 310
  • 1
  • 3
  • 12
  • 2
    Have you made sure that the `p` element you are looking for is not created by javascript modifying the DOM after the initial load? – kreld Dec 17 '19 at 16:00
  • 2
    Does this answer your question? [Web-scraping JavaScript page with Python](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python). If you use "view page source" the element doesn't exist. Use a headless browser. – ggorlen Dec 17 '19 at 16:01
  • Save the html response in file and you will know is happening, – Wonka Dec 17 '19 at 16:03
  • @Wonka no need to save it anywhere - just print it. – bruno desthuilliers Dec 17 '19 at 16:04
  • 1
    @kreld That seems to be the reason, thanks for your answer. – Ricardo Mehr Dec 17 '19 at 16:12
  • 1
    Take a look at local http proxying like fiddler to maybe get the routes of the api to get the price value, headless browning is a good solution but if you want to do a lot it's way heavier and slower then using direct http requests. – Luiz Fernando Lobo Dec 17 '19 at 16:18

3 Answers3

3

If you go to network tab you get the following link which retrieve data in json format.You can do that without Selenium and Beautifulsoup

Url="https://www.falabella.com/rest/model/falabella/rest/browse/BrowseActor/fetch-item-details?{%22products%22:[{%22productId%22:%225311634%22},{%22productId%22:%225311597%22},{%22productId%22:%225311505%22},{%22productId%22:%226009874%22},{%22productId%22:%225311494%22},{%22productId%22:%225311510%22},{%22productId%22:%226009845%22},{%22productId%22:%226009871%22},{%22productId%22:%226009868%22},{%22productId%22:%226009774%22},{%22productId%22:%226782957%22},{%22productId%22:%226009783%22},{%22productId%22:%226782958%22},{%22productId%22:%228107608%22},{%22productId%22:%228107640%22},{%22productId%22:%226009875%22},{%22productId%22:%226782967%22},{%22productId%22:%226782922%22}]}"

Try the below code.

import requests

page_url = "https://www.falabella.com/rest/model/falabella/rest/browse/BrowseActor/fetch-item-details?{%22products%22:[{%22productId%22:%225311634%22},{%22productId%22:%225311597%22},{%22productId%22:%225311505%22},{%22productId%22:%226009874%22},{%22productId%22:%225311494%22},{%22productId%22:%225311510%22},{%22productId%22:%226009845%22},{%22productId%22:%226009871%22},{%22productId%22:%226009868%22},{%22productId%22:%226009774%22},{%22productId%22:%226782957%22},{%22productId%22:%226009783%22},{%22productId%22:%226782958%22},{%22productId%22:%228107608%22},{%22productId%22:%228107640%22},{%22productId%22:%226009875%22},{%22productId%22:%226782967%22},{%22productId%22:%226782922%22}]}"
headers={
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
response=requests.get(page_url, headers=headers)
res=response.json()
for item in res['products'][0]['product']['prices']:
    print(item['symbol'] + item['originalPrice'])

Output:

$ 379.990
$ 569.990

To get the product name:

print(res['products'][0]['product']['displayName'])

Output:

Smartphone iPhone 7 PLUS 32GB

If you only looking for the value $ 379.990 the print this

print(res['products'][0]['product']['prices'][0]['symbol'] +res['products'][0]['product']['prices'][0]['originalPrice'] )
KunduK
  • 32,888
  • 5
  • 17
  • 41
2

The problem is that a JS script is inserting this HTML node dynamically after the page load. The request retrieves only the raw HTML and doesn't wait around for scripts to run.

You can use a headless browser such as Chrome Webdriver which is able to wait for the page to load in real time, then access the DOM dynamically. Here's a sample of how you could use this after installing it:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/"
opts = Options()  
opts.add_argument("--headless")  
opts.add_argument("log-level=3") # suppress console noise
driver = webdriver.Chrome(options=opts)
driver.get(url)

print(driver.find_element_by_class_name("fb-price").text) # => $ 379.990

As pointed out in the other answer, another good option is to make the same API call to the URL that the script uses to access the data. There's nothing to install or import using this approach, so it's very lightweight, and the API may be less brittle than the class name (or vice versa).

ggorlen
  • 44,755
  • 7
  • 76
  • 106
1

This is extremely hacky, and for real use cases, I would suggest using this: Web-scraping JavaScript page with Python


By downloading the raw HTML via cURL and using grep (in your case, you could use a search on the source in Sources tab in the explorer), I was able to find that the price was stored in the fbra_browseMainProductConfig variable. Using BeautifulSoup, I was able to pull the script for it:

import requests, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/").content)
# grab the text where it has `fbra_browseMainProductConfig` in it, and strip the extra whitespace
script_contents = soup(text=re.compile("fbra_browseMainProductConfig"))[0].strip()

From there, I checked the output, and found that the first line was the fbra_browseMainProductConfig declaration. So:

import json
# split the contents of the script tag into lines, take the first element (0th index), strip any additional whitespace
mainProductConfigLine = script_contents.splitlines()[0].strip()
# split the variable from the declaration, JSON that (removing the ending semicolon)
mainProductConfig = json.loads(mainProductConfigLine.split(" = ",1)[1][:-1])
# grab the prices (plural, there are more than one)
# in order to find the key, I messed around with the dict in a Python REPL and found it
prices = [price["originalPrice"] for price in mainProductConfig["state"]["product"]["prices"] if "originalPrice" in price]

Hope this helps!