lxml etree and xpath returning an encoded image rather than URL for src

Question

I want the src url of an image when I process some html, but I am getting back an encoded image. What am I doing wrong if I want the url?

Given a url like: "http://www.amazon.com/Cheese-Plate-multi-purpose-mounting-plate/dp/B00CI06DWE/"

And a desktop user agent:

from lxml import etree
import requests

page = requests.get(url, headers=agent)
page_txt = page.text

html_parser = etree.HTMLParser()
tree = etree.parse(StringIO(page_txt), html_parser)

path = '//img[@id="landingImage"]'

img = tree.xpath(path)

img_src = img[0].get('src')

using that code, I'm getting back:

'\ndata:image/jpeg;base64,/9j/4AAQSkZJR'(truncated)

when I want:

http://ecx.images-amazon.com/images/I/41SNmVfXvhL.SY355.jpg

score 2 · Accepted Answer · edited May 23 '17 at 12:21

2

There is a base64 encoded image in the src attribute. You can get the actual URL from the data-a-dynamic-image attribute, it contains JSON string with url inside:

import json 

path = '//img[@id="landingImage"]/@data-a-dynamic-image'
print next(json.loads(tree.xpath(path)[0]).iterkeys())

Prints:

http://ecx.images-amazon.com/images/I/41SNmVfXvhL._SX466_.jpg

edited May 23 '17 at 12:21

Community

1
1

answered Sep 22 '14 at 01:54

alecxe

462,703
120
1,088
1,195

lxml etree and xpath returning an encoded image rather than URL for src

1 Answers1