What route should I go to web-scrape these images from the CB2 website: Selenium or POST requests approach?

Question

I've been using selenium to try to web-scraping images and product data from the CB2 website, specifically this URL: https://www.cb2.com/furniture/sofas/1/filters/sofas~2A, but I keep getting only the images/product data for the first 8 items. After that, I only get the placeholder image. I initially thought it was because the images are lazy-loaded but after adapting my code, I'm still getting the same result. I don't really know if I'm doing something wrong or if my approach is not correct.

I found this other approach: https://stackoverflow.com/a/62722063/20433945, which I think might work for my case, but I can't find the POST form data to imitate the POST request to this api: https://ingest.quantummetric.com/crateandbarrel. I looked for it in the browser's network traffic logger XHR (XmlHttpRequest) requests, but haven't been able to find it. Edit: This is what I see in the POST request:

x}Ko@FÿËÝqJ}">*ÔÚ¦Q_ÅÆÿ^RMkêþóåæ¡æë"0KB®SÇ
;0A"5I$013D`ëì¤~[ÆFÏZø·Q´tFÑP_ÊkÒ=ÜaqÑû©z£;¨A®Ñ õ»1?¸¨9gµÒz×ÃNx[×È/ý6H/ ¦³ó©Qù¹ÑCYî»Sw¹{TfdÓíUþhÛvPÇdjÓÊ?Yâ`=+Ò}²Vt lEËjÌÇð}¦6¢+M©ÃSÛk|ô§K»SÇ½W{F:~Só<oÃhÔx3«>D»µÍ±âbÇÏÑÊeq=A½Li8½>9³z

This is the python code:

def data_getter():
  driver.get(url)
        
  images_list = []
  titles_list = []
  pages_list = []
  price_list = []
        
  scroll_pause_time = 0.5
  i = 0
  last_height = self.driver.execute_script("return document.body.scrollHeight")
  while True:
    self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)
    new_height = self.driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
      break
    last_height = new_height
    i += 1
    if i == 5:
      break
    wait = WebDriverWait(self.driver, 10)
    wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "ul.card-deck-container")))
    self.driver.implicitly_wait(10)
        
    for item in self.driver.find_elements(By.CSS_SELECTOR, group_selector):
      title_txt = ""
      page_url = ""
      price_txt = ""
      try:
        image = item.find_element(By.CSS_SELECTOR, image_selector)
        image_url = image.get_attribute(image_attr)

        if title_selector is not None:
          title = item.find_element(By.CSS_SELECTOR, title_selector)
          title_txt = title.get_attribute('alt')
                   
        if product_page_selector is not None:
          page = item.find_element(By.CSS_SELECTOR, product_page_selector)
          page_url = page.get_attribute('href')

        if price_selector is not None:
          price = item.find_element(By.CSS_SELECTOR, price_selector)
          price_txt = price.text
          price_list.append(price_txt)
        else:
          price_txt = None
          price_list.append(price_txt)

        images_list.append(image_url)
        titles_list.append(title_txt)
        pages_list.append(page_url)

      except Exception as e:
        print(e)
        print(len(images_list))
        print(len(titles_list))
        print(len(pages_list))
        print(len(price_list))
        print(images_list)

This is what it prints (it actually prints the URLs but I removed them so my post isn't flagged as spam):

89
89
89
89

['curvo-snow-sofa.jpg', 
'gwyneth-boucle-loveseat.jpg', 
'camden-white-sofa.jpg', 
'outline-sofa.jpg', 
'muir-grey-woven-curved-sofa.jpg', 
'muir-camel-velvet-curved-sofa.jpg', 
'lenyx-saddle-leather-sofa.jpg', 
'bacio-sofa.jpg',
'white_15x15_loader',
.
.
.
'white_15x15_loader',
'white_15x15_loader']

that "white_15x15_loader" text sure does seem like a placeholder used for lazy loading images. You don't mention what you tried to resolve that... (sleep after scroll?) The site may also be smart enough to know whether those placeholders are actually visible before loading content into them and could use a sort of rolling cache for content loaded. For the 2nd question here, how do you know the URL to the API if you can't find the request to it? — pcalkins, Mar 24 '23 at 21:01
@pcalkins I tried sleep after scroll but same result. For the second question, I actually see the request, I'll update the explanation with what I see. I'm also not sure if that's the URL to the API. — br1saturno, Mar 24 '23 at 21:42
If the duration of the sleep has no effect on the result, you probably have a rolling cache of 8 items. (and actually you'd be getting stale element exceptions if the list were still loading...) Not sure why you're getting all those "loader" elements though.... might just be lousy coding.... or they adjust the size of the shown elements according to the size of the browser or the location of elements in the viewport. (visibility api?) At any rate, you'd need to get 8 at a time. (so grab some before scrolling... scroll 8 items down, grab the next 8, etc...) — pcalkins, Mar 24 '23 at 22:27

What route should I go to web-scrape these images from the CB2 website: Selenium or POST requests approach?

0 Answers0