1

As the title suggest: calling the requests.get() method gives me a different image src link as opposed to when browsing the site manually.

I'm trying to scrape a site for products and want to store the images but the src I get from the site is for a very low quality image that's blurry. I compared the src to the one on the site and it's different. Not sure if I need to pass it something to "force" screen size in the request?

My code below:

from requests import get
from lxml import html

def demo():
    params = {'page': 0}
    response = get('https://www.checkers.co.za/c-2256/All-Departments', params=params)
    tree = html.fromstring(response.content)
    images = tree.xpath('//a[@class="product-listening-click"]/img[@src]')
    images = ['https://www.checkers.co.za' + image.attrib['src'] for image in images]
    print(images)

The src differences of first link in list to original image on the site

site src:
https://www.checkers.co.za/medias/10136669EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3wzNTA4MHxpbWFnZS9wbmd8aW1hZ2VzL2g1My9oZmYvODg1NzQ3ODYyNzM1OC5wbmd8YTM4YjE3YmMxYzJjMzI4MmIzMTQ0ZWU1MjlkYjBmNWZjZGFhYzYxYzAyZGMyNDhlNDE0MDhjYWQ0MjQxNmQ3NA

retrieved src:
https://www.checkers.co.za/medias/lqi-10136669EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3wxNDkwfGltYWdlL3BuZ3xpbWFnZXMvaDY1L2gwNC85MDgxNzU4NDgyNDYyLnBuZ3w0MmY3ZmMzNzJmYTU0MGIzNDk0ZjdmOTkyODYwMGI3N2I5YWJhZDRkOTljNzViYjIxMWQ3OWU2NDVjZGZhZTdm

EDIT 1:

I tried adding a User-agent header using the fake-useragent package and looped through all possible ones. src results did not change.

EDIT 2:

seems like parsing it with lxml.html as opposed to bs4 gives different output for image's data-original-src. Not sure why but thank you @AmineBTG for helping notice this issue.

Note:
accessing the data-original-src with lxml.html with //a[@class="product-listening-click"]/img/@data-original-src instead of '//a[@class="product-listening-click"]/img[@data-original-src]' did the trick. Header was not required by the looks of it when testing.

Marco Fernandes
  • 326
  • 1
  • 4
  • 13

2 Answers2

2

It works when passing the appropriate headers and retrieving the "data-original-src" attribute rather than "src" attribute. Please see below code (slightly modified)

import requests
from bs4 import BeautifulSoup

def demo():
    header ={
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-language": "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7",
        "cache-control": "max-age=0",
        "sec-ch-ua": "\"Google Chrome\";v=\"87\", \" Not;A Brand\";v=\"99\", \"Chromium\";v=\"87\"",
        "sec-ch-ua-mobile": "?0",
        "sec-fetch-dest": "document",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "none",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
    }

    params = {'page': 0}

    r = requests.get('https://www.checkers.co.za/c-2256/All-Departments', headers = header, params=params)
    s = BeautifulSoup(r.content, "html.parser")

    products = s.find_all("div", {"class":"item-product__image"})
    images = ['https://www.checkers.co.za' + prod.find("img").attrs.get("data-original-src") for prod in products]

    return images

print(demo())

Output:

['https://www.checkers.co.za/medias/10136669EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3wzNTA4MHxpbWFnZS9wbmd8aW1hZ2VzL2g1My9oZmYvODg1NzQ3ODYyNzM1OC5wbmd8YTM4YjE3YmMxYzJjMzI4MmIzMTQ0ZWU1MjlkYjBmNWZjZGFhYzYxYzAyZGMyNDhlNDE0MDhjYWQ0MjQxNmQ3NA', 'https://www.checkers.co.za/medias/10136301EAV2-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w1ODA4OHxpbWFnZS9wbmd8aW1hZ2VzL2g2ZS9oMGIvOTA5NjUxNTA5MjUxMC5wbmd8ZTViYzUzY2FiOWIyNWNmYmY0OGQ0ZGY0ZmY2ZDQwMGI3Nzk4ODMwOGYzMWRhNjIxOGZmM2Y1ZTExNDgxZWZjMg', 'https://www.checkers.co.za/medias/10151456EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w4NDE0NHxpbWFnZS9wbmd8aW1hZ2VzL2gzNC9oMjIvODg1NzgxMjU5ODgxNC5wbmd8OWMxYjE4Nzc0MjNkZTU2ZGI5ZDZmN2Q2M2FhMTdhZmM3Yjc4NDgwMzEwMjg1NmNiZTM1YWNjZjkxOTUyMzhmNQ', 'https://www.checkers.co.za/medias/10151458EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w1MTc0MnxpbWFnZS9wbmd8aW1hZ2VzL2g2MS9oYmEvODg1Nzg5Nzk5MjIyMi5wbmd8ZDFhNTlkMmJiNGY3ZDA3YjU5NjkwZGMxMjY3ZTgzMDVjYzFkMDkxNzI4NzlmN2U2MjhjZGJmYjE1NDg5ZDU2ZA', 'https://www.checkers.co.za/medias/10143000EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w0NjYyN3xpbWFnZS9wbmd8aW1hZ2VzL2g2MC9oZDIvODg1NzY3NDA4ODQ3OC5wbmd8ZDAzYzk5ZGFkY2Y1NjBiOTllZGJjOTVkZTUwNTg3NjBhYTM4NTk0OGFiYzk4OWRlNDQxZTdkNjQzNWM5YmU1NA', 'https://www.checkers.co.za/medias/10136298EAV2-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w1NzYzNXxpbWFnZS9wbmd8aW1hZ2VzL2g3MC9oZWEvOTA5NTI5NTY2NDE1OC5wbmd8NDVkZmY4YWU4NWY3MDliYjYyNTk4MWM1NzIyOTNlMjYwMzEwOGFiNGNiZTEzNGRhMmVkZjNiZTU0ZjNiYThiMw', 'https://www.checkers.co.za/medias/10151462EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w2MzY1MHxpbWFnZS9wbmd8aW1hZ2VzL2g4Zi9oMjcvODg1NzgxNDU2NDg5NC5wbmd8ZTE5ZmViZTdjNDNkOTVjMjZkODYwMjA4YTczNTgwNjU5ZmViMmE4OTQ4YzUwYjgwMjI0ZGJkNzJkNTI5OGU0Mg', 'https://www.checkers.co.za/medias/10241929EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w2MjMwOXxpbWFnZS9wbmd8aW1hZ2VzL2g0NS9oY2QvODg1ODQ2ODM4NDc5OC5wbmd8YzE1MGZkOWI2MjAyOWVlNzQ2YmRkMWM2MDNhZTk3ZGFkYWY4ZWMxNzA5Njc5OTMxNzY3OWEyNzg5MzczZmM1ZA', 'https://www.checkers.co.za/medias/10165121EAV2-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w2NTUzMXxpbWFnZS9wbmd8aW1hZ2VzL2gyZC9oYjcvODg2NDkyODM2NjYyMi5wbmd8NzEyODllNDlmZjE1NjJmMzEyMmU4MTU4NWQ4ZjRjYmM1Nzc2NWNjM2Y2YmFmZGQ1N2Y5ZjFmOTY0ZjBkMGE1OQ', 'https://www.checkers.co.za/medias/10145817EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3wyMzU0MHxpbWFnZS9wbmd8aW1hZ2VzL2gwMy9oZDIvODg1Nzc0ODc5OTUxOC5wbmd8NGQxZjA0OWNkZTVjY2JmNTI2ZTdlOGIwZjFiMmE5MGFhYjQ2NjZhMDBiYWMyNzVhYjMxNTI4YTZjYWU3OWZhZA', 'https://www.checkers.co.za/medias/10151065EAv2-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w2MDkyNHxpbWFnZS9wbmd8aW1hZ2VzL2g2Ny9oMDEvOTI3ODc2MDE4OTk4Mi5wbmd8YWQ1N2E4Y2ZmNTQ3YzA1ZDdmODcyZDlmZTg4ZGUwZGJhOWQ3ZGNiNWI5ZmE4OWFmOTVkZDgyYjEzOTUyZjlhYg', 'https://www.checkers.co.za/medias/10126789EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3wxMDYzMDh8aW1hZ2UvcG5nfGltYWdlcy9oNzcvaDAzLzg4NjA4NzYzMDg1MTAucG5nfDVhMTE1MmE5YmMxNjE0OGZmM2IwOTcxMWQzYWIyY2IxOTU2MmY1M2M2N2MzZjc5ZDE2YWFmNGFiZjdiOTI1YzY', 'https://www.checkers.co.za/medias/10241933EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w2NDQ4NXxpbWFnZS9wbmd8aW1hZ2VzL2gzNy9oMTkvODg1ODQwMjg4MTU2Ni5wbmd8NDA5MDRlMDZkM2U3M2JiODUwNWVmYmE5YzM3NjQ3NDkxYTMzZmI0ZjY3OWFkNDZiODU2YjQ2NTRjOTQyNjI2MQ', 'https://www.checkers.co.za/medias/10147193EAV2-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w2MDk2M3xpbWFnZS9wbmd8aW1hZ2VzL2g4Zi9oNTUvODk1NjcwODIyNTA1NC5wbmd8YmFkMTgzZDFkNGRiY2ViMzU4MjNhMGY0ZmM1OTgyY2U4NTY5MGZiZWI5ZTMzNjE1ZTNkY2Q1YzAwY2JhZTgyYw', 'https://www.checkers.co.za/medias/10164636EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w0MDM4NnxpbWFnZS9wbmd8aW1hZ2VzL2gxMS9oYWIvODg1ODA5NTU4MzI2Mi5wbmd8YTYxY2ZkNjAzOTg2NGVmNGMxODVjNmRkNTAyYmYzOWM2ZDU0MzgyYzk3YjM2YWUzOTRkMGEwOWE0NmVjMDQ0NA', 'https://www.checkers.co.za/medias/10136574EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w1MzU0NXxpbWFnZS9wbmd8aW1hZ2VzL2hjMi9oNjEvODg1NzQzMjk4MTUzNC5wbmd8YTJmZGU0ZGVjNDU1NTIyMzU0NGM1ZTQzOTQ4OTUwNmEwN2I3MDc5MjliOWNkNmRlNTgzZDMxYjdkMmNjNTIyYw', 'https://www.checkers.co.za/medias/10604301EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w1OTUxN3xpbWFnZS9wbmd8aW1hZ2VzL2hiZi9oZDQvODg1OTg2MzY0NjIzOC5wbmd8MTE5ZTQ4NzJkMzhjMzczMDk0MTE4YTZhZTllMTFlYjBiMTUyY2IzNjIyMmM4NzFlODA0MTU1Yzg3ZWNkMjMyNQ', 'https://www.checkers.co.za/medias/10145422EA-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w3OTc0MHxpbWFnZS9wbmd8aW1hZ2VzL2gxYy9oNTIvODk2MjM0ODIyMDQ0Ni5wbmd8MDc5NjY1YzY2NGE0NDFiNWRiN2NkNWZkMWJlODg5MDhlOWUyZWNhNDEzMWJiZTQ3MjM5MDYyZjgzZWYyYWM2Mg', 'https://www.checkers.co.za/medias/10136291EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w5NjM3NnxpbWFnZS9wbmd8aW1hZ2VzL2hjNi9oMzUvODg1NzQzMDU1NjcwMi5wbmd8ZDQwZjEzMmU5Y2JkMDhkNGM2MGQ4ZTc1MWY0Y2Q5YTJhZWI2YmM2YmY5YjNiYWEyZjQ0YWQ5ZDgyMmE3ZWE2YQ', 'https://www.checkers.co.za/medias/10148833EAV2-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w3NTU3NnxpbWFnZS9wbmd8aW1hZ2VzL2g1OC9oNzIvODk1NjY5NjQ2MTM0Mi5wbmd8N2YyZmMxMjA5ZjkxZjkzZWExN2E2MGE1ZTZiZjI0M2FkMDcxZTVlMzY0ZjAzOTRjMjAzNzRjYWQ5Yzk4NjZkNQ'] 
AmineBTG
  • 597
  • 3
  • 13
  • So I noticed something when comparing your version and mine. The output is different depending on whether it's parsed with `bs4` or `lxml.html`. Adding the headers to and accessing the `data-original-src` with `lxml.html` seems to have no affect and gives the wrong output. Thank you for this, I want to keep the script as light as possible so I'll dig deeper as to why the output is different – Marco Fernandes Feb 03 '21 at 11:15
  • I think I was access the `data-original-src` wrong with `lxml.html`. So I changed it from `//a[@class="product-listening-click"]/img[@src]` to `//a[@class="product-listening-click"]/img/@data-original-src` and now I am getting the correct image too. I'll give you the answer tick since you helped me with the `data-original-src` thought process. – Marco Fernandes Feb 03 '21 at 11:28
  • 1
    Hi, yes indeed headers has no effect, the issue was on 'data-original-src'. it is just good practice to put a headers (at least the user-agent) to avoid to be blocked when web-scraping. – AmineBTG Feb 03 '21 at 14:41
  • Thank you again for the help. Hope to use this data with a custom Alexa grocery skill :) – Marco Fernandes Feb 03 '21 at 15:09
1

When a web browser sends an HTTP request, it includes a bunch of info about itself in the header, allowing the website to retrieve a version of itself that is best suited for display in that particular browser. When you make a request via the requests module, the website doesn't get any of this information, and sends a version of the site that is slightly different from what you would get in a browser.

This is why you are getting two different image sources depending on how you request the website. The browser is getting a higher-quality image because the website has enough info about the way the image is going to be used to send the best possible version of the image, while the script request gets a lower-quality image because the website sends a much smaller version of the image to reduce traffic.

  • 1
    is there a way maybe to fake the window size with requests maybe? – Marco Fernandes Feb 03 '21 at 08:51
  • 1
    Take a look at [python's request headers](https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers) and [User-Agent headers](https://stackoverflow.com/questions/27652543/how-to-use-python-requests-to-fake-a-browser-visit-a-k-a-and-generate-user-agent) – Awesomepotato29 Feb 03 '21 at 08:58
  • I tried adding a user agent with fake-useragent as recommend in the links you shared but still getting the same output. I looped through all possible lists of useragents in the library/package – Marco Fernandes Feb 03 '21 at 09:06