2

I want the src url of an image when I process some html, but I am getting back an encoded image. What am I doing wrong if I want the url?

Given a url like: "http://www.amazon.com/Cheese-Plate-multi-purpose-mounting-plate/dp/B00CI06DWE/"

And a desktop user agent:

from lxml import etree
import requests

page = requests.get(url, headers=agent)
page_txt = page.text

html_parser = etree.HTMLParser()
tree = etree.parse(StringIO(page_txt), html_parser)

path = '//img[@id="landingImage"]'

img = tree.xpath(path)

img_src = img[0].get('src')

using that code, I'm getting back:

'\ndata:image/jpeg;base64,/9j/4AAQSkZJR'(truncated)

when I want:

http://ecx.images-amazon.com/images/I/41SNmVfXvhL.SY355.jpg

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
dolphinkickme
  • 73
  • 1
  • 8

1 Answers1

2

There is a base64 encoded image in the src attribute. You can get the actual URL from the data-a-dynamic-image attribute, it contains JSON string with url inside:

import json 

path = '//img[@id="landingImage"]/@data-a-dynamic-image'
print next(json.loads(tree.xpath(path)[0]).iterkeys())

Prints:

http://ecx.images-amazon.com/images/I/41SNmVfXvhL._SX466_.jpg
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195