-1

Please pardon my ignorance but I can't get my head around this. I had to create a new question as I have realized that I don't really know how to do this. So how to scrape the images from the webpage like this https://www.jooraccess.com/r/products?token=feba69103f6c9789270a1412954cf250 ? I have an experience with BeautifulSoup but as far as I understand, I need to use some other package here? soup.find("div", class_="PhotoBreadcrumb_...6uHZm") doesn't work

<div class="PhotoBreadcrumb_PhotoBreadcrumb__14D_N ProductCard_photoBreadCrumb__6uHZm">
   <img src="https://cdn.jooraccess.com/img/uploads/accounts/678917/images/Iris_floral02.jpg" alt="Breadcrumb">

<div class="PhotoBreadcrumb_breadcrumbContainer__2cALf" data-testid="breadcrumbContainer">
    <div data-position="0" class="PhotoBreadcrumb_dot__2PbsQ"></div>
    <div data-position="1" class="PhotoBreadcrumb_dot__2PbsQ"></div>
    <div data-position="2" class="PhotoBreadcrumb_dot__2PbsQ"></div>
    <div data-position="3" class="PhotoBreadcrumb_dot__2PbsQ"></div>
    <div data-position="4" class="PhotoBreadcrumb_active__2T6z2 PhotoBreadcrumb_dot__2PbsQ"></div>
    <div data-position="5" class="PhotoBreadcrumb_dot__2PbsQ"></div>
</div>
hkm
  • 342
  • 1
  • 2
  • 10
  • There is already a question from you on the subject, which has already received answers that point to the problem or show an alternative. [How to scrape images from slider/slideshow?](https://stackoverflow.com/questions/71299993/how-to-scrape-images-from-slider-slideshow) – HedgeHog Mar 01 '22 at 08:24
  • If it is not a duplicate, **what is the difference**? Suggestion - Take a minute or two to read [ask] and improve your question with a [mcve] . Thanks – HedgeHog Mar 01 '22 at 08:38

1 Answers1

0

BeautifulSoup is for cleaning the html gotten after sending http request, in your case you should :

1. Send http request to your target website with requests module. (with appropriate headers).

2. Select the json data of the response.

3. Iterate over the list of products.

4. For each product get the img_urls .

5. Send a new request to get each image in your list of urls.

6. Save the image.


Code :

Note : you should update the cookie in the headers to get a response.

import requests
from os.path import basename
from urllib.parse import urlparse

URL = 'https://atlas-main.kube.jooraccess.com/graphql'
headers = {"accept": "*/*","Accept-Encoding": "gzip, deflate, br","Accept-Language": "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7","Connection": "keep-alive","Content-Length": "2249","content-type": "application/json","Cookie":"_hjSessionUser_686103=eyJpZCI6ImY4MTZjN2YxLWJlYmQtNTg2ZC1iYmRkLTllYjdhNGQzNmVjYiIsImNyZWF0ZWQiOjE2NDYxMTkwMDUyODcsImV4aXN0aW5nIjp0cnVlfQ==; _hjSession_686103=eyJpZCI6ImM5MWJmOGRhLTcwZDEtNGQ2ZS04MzA1LTQ4NWNlYTYzZGMwNSIsImNyZWF0ZWQiOjE2NDYxMjc3MDQ5MjgsImluU2FtcGxlIjp0cnVlfQ==; _hjAbsoluteSessionInProgress=0; mp_2e072c90929b30e1ea5d9fd56399f106_mixpanel=%7B%22distinct_id%22%3A%20%2217f4456c057375-062236d0c47071-a3e3164-144000-17f4456c05857f%22%2C%22%24device_id%22%3A%20%2217f4456c057375-062236d0c47071-a3e3164-144000-17f4456c05857f%22%2C%22%24initial_referrer%22%3A%20%22%24direct%22%2C%22%24initial_referring_domain%22%3A%20%22%24direct%22%2C%22accountId%22%3A%20null%2C%22canShop%22%3A%20false%2C%22canTransact%22%3A%20false%2C%22canViewAssortments%22%3A%20false%2C%22canViewDataPortal%22%3A%20false%2C%22userId%22%3A%20null%2C%22accountUserId%22%3A%20null%2C%22isAdmin%22%3A%20false%2C%22loggedAsAdmin%22%3A%20false%2C%22retailerSettings%22%3A%20false%2C%22assortmentPlanning%22%3A%20false%2C%22accountType%22%3A%201%7D","Host": "atlas-main.kube.jooraccess.com","Origin": "https://www.jooraccess.com","Referer": "https://www.jooraccess.com/","sec-ch-ua-mobile": "?0","sec-ch-ua-platform": "Windows","Sec-Fetch-Dest": "empty","Sec-Fetch-Mode": "cors","Sec-Fetch-Site": "same-site","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",}
result = requests.get(URL, headers ).json() # you may need headers argument so you should add it in this case

data = result["data"]["public"]["collectionProductsByShareToken"]["edges"]

for d in data:
  img_urls = d["product"]["imageUrls"]
  for img_url in img_urls:
    img_data = requests.get(img_url).content
    img_name = basename(urlparse(img_url).path)
    with open(img_name , 'wb') as handle:
      response = requests.get(img_url, stream=True)

      if not response.ok:
        print(response)

        for block in response.iter_content(1024):
          if not block:
            break

          handle.write(block)
    
  • Source for save the image : https://stackoverflow.com/questions/30229231/python-save-image-from-url –  Mar 01 '22 at 07:57
  • thanks for the response but it doesn't work, when I view the "page" after executing page = BeautifulSoup(result.content, features='html.parser') I don't even see the same html code that I see when I click 'Inspect' in Chrome, this one is much shorter and doesn't even contain 'img' or 'src' – hkm Mar 01 '22 at 08:15
  • 1
    After reviewing your case, I found that the site send the data in json format (graphql behind the scenes), but there are multiple products and each have a list of image urls, want you get all of the images or what ? –  Mar 01 '22 at 08:49
  • Yes, thank you, sudoer Ali! Yes, ideally I'd like all of the images – hkm Mar 01 '22 at 09:22
  • Check the new version of code now, you should update the cookie in the headers. –  Mar 01 '22 at 10:58
  • Thank you very much for trying to help me but for some reason this still doesn't work for me, either I'm copy-pasting wrong cookies or not formatting it correctly – hkm Mar 01 '22 at 12:14