1

I'm using xpath to scrape a amazon webpage particular, but it doesn't work. Can any one give me some advice? Here's the link to that page: a link

I want to scrape these: "Fun, credit card-sized prints" The code i'm using is here:

from lxml import html
import requests

url = 'http://www.amazon.co.uk/dp/B009CX5VN2'
page = requests.get(url)
tree = html.fromstring(page.text)
feature_bullets = tree.xpath('//*[@id="feature-bullets"]/ul/li[1]/span/text()')

But the feature_bullets is always empty. Really need some help.

nyedidikeke
  • 6,899
  • 7
  • 44
  • 59
user2372074
  • 781
  • 3
  • 7
  • 18
  • 1
    I draw your attention to the second paragraph of section 3 of [Amazon UK's Conditions of Use & Sale](http://www.amazon.co.uk/gp/help/customer/display.html/ref=footer_cou/279-7931089-4191464?ie=UTF8&nodeId=1040616). – Robᵩ Jul 31 '14 at 16:37

1 Answers1

1

The HTML that I download doesn't match your expectations. Here is the expression that works for me:

tree.xpath('//div[@id="technicalProductFeaturesATF"]/ul/li[1]/text()')

Complete program:

from lxml import html
import requests
from pprint import pprint

url = 'http://www.amazon.co.uk/dp/B009CX5VN2'
page = requests.get(url)
tree = html.fromstring(page.text)
feature_bullets = tree.xpath('//div[@id="technicalProductFeaturesATF"]/ul/li/text()')

pprint(feature_bullets)

Result:

$ python foo.py 
['Fun, credit card-sized prints',
 'LCD film counter and shooting mode display',
 'Camera mounted mirror for self portraits',
 'Powered by CR2 Batteries, Built-in, Automatic electronic flash',
 'Fujifilm Instax Mini 25 + 30 Instax Mini Film']
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • thanks for you answer. I noticed the html the program was reading is slightly different from the html in chrome or any other browser, although I don't why that's the case. Maybe each browser has its own standard format. – user2372074 Jul 31 '14 at 18:37
  • 2
    You might consider [sending a user-agent string](http://stackoverflow.com/questions/10606133/how-to-send-user-agent-in-requests-library-in-python) along with your request. If you use the [user-agent from your browser](http://www.whatsmyuseragent.com/), you'll probably get the same page. – Robᵩ Jul 31 '14 at 18:47