2

I'm trying to write a python script which parses one element from a website and simply prints it.

I couldn't figure out how to achieve this, without selenium's webdiver, in order to open a browser which handles the scripts to properly display the website.

from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()

This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>.

How could I get the element of interest without the browser opening, or even without a browser at all?

edit: I've previously tried to use urllib or in bash wget, which both lack the required javascript interpretation.

boolean.is.null
  • 831
  • 2
  • 12
  • 19
  • I'm planning to create a small Python script. – boolean.is.null Oct 13 '15 at 00:30
  • Ok, I'm working on it :) I'll post my answer in a bit. Just to make sure I got it right, You need the price element, right ? – Pedro Lobito Oct 13 '15 at 00:32
  • 1
    You want to hide the browser? Duplicate of http://stackoverflow.com/questions/5370762/how-to-hide-firefox-window-firefox-webdriver/23898148#23898148 – RobertB Oct 13 '15 at 00:36
  • In the meanwhile, you can take a look at http://www.crummy.com/software/BeautifulSoup/bs4/doc/, to install use `pip install BeautifulSoup4` – Pedro Lobito Oct 13 '15 at 00:51
  • You can only parse that page with a browser. The page doesn't display anything if javascript isn't enabled. Selenium is the way to go. – Pedro Lobito Oct 13 '15 at 00:57

2 Answers2

2

As other answers mentioned, this webpage requires javascript to render content, so you can't simply get and process the page with lxml, Beautiful Soup, or similar library. But there's a much simpler way to get the information you want.

I noticed that the link you provided fetches data from an internal API in a structured fashion. It appears that the product number is 910000800509 based on the url. If you look at the networking tab in Chrome dev tools (or your brower's equivalent dev tools), you'll see that a GET request is being made to following URL: http://groceries.asda.com/api/items/view?itemid=910000800509.

You can make the request like this with just the json and requests modules:

import json
import requests

url = 'http://groceries.asda.com/api/items/view?itemid=910000800509'
r = requests.get(url)
price = r.json()['items'][0]['price']

print price
£13.00

This also gives you access to lots of other information about the product, since the request returns some JSON with product details.

evan_schmevan
  • 1,591
  • 12
  • 19
  • Elegant, simplistic and functional! – boolean.is.null Oct 13 '15 at 10:26
  • I'd like to know how you found the GET request.. Also, this product number `910000456105` doesn't seem to work, I always get price `£0.00`.. from the URL `http://groceries.asda.com/product/canned-lagers/tennents-lager/910000456105`.. other than that, perfect! – boolean.is.null Oct 13 '15 at 11:00
  • I'd edit my comment but it's too late. It seems like the product is currently not available, thus the price. – boolean.is.null Oct 13 '15 at 13:26
  • I just checked this other product `910000456105`. It looks like when the product is unavailable, the price shows as £0.00. You can see the previous price at the `wasPrice` attribute with `price = r.json()['items'][0]['wasPrice']`. – evan_schmevan Oct 13 '15 at 17:17
  • I found the GET request by visiting the url you provided, then I opened dev tools. Then, click on the network tab, and refresh the page. You can see all requests being made here. This might be helpful to you: [http://discover-devtools.codeschool.com/](http://discover-devtools.codeschool.com/) – evan_schmevan Oct 13 '15 at 17:21
1

How could I get the element of interest without the browser opening, or even without a browser at all?

After inspecting the page you're trying to parse :

http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509

I realized that it only displays the content if javascript is enabled, based on that, you need to use a real browser.


Conclusion:

The way to go, if you need to automatize, is:

selenium

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268