I've been using selenium to try to web-scraping images and product data from the CB2 website, specifically this URL: https://www.cb2.com/furniture/sofas/1/filters/sofas~2A, but I keep getting only the images/product data for the first 8 items. After that, I only get the placeholder image. I initially thought it was because the images are lazy-loaded but after adapting my code, I'm still getting the same result. I don't really know if I'm doing something wrong or if my approach is not correct.
I found this other approach: https://stackoverflow.com/a/62722063/20433945, which I think might work for my case, but I can't find the POST form data to imitate the POST request to this api: https://ingest.quantummetric.com/crateandbarrel. I looked for it in the browser's network traffic logger XHR (XmlHttpRequest) requests, but haven't been able to find it. Edit: This is what I see in the POST request:
x}Ko@FÿËÝqJ}">*ÔÚ¦Q_ÅÆÿ^RMkêþóåæ¡æë"0KB®SÇ
;0A"5I$013D`ëì¤~[ÆFÏZø·Q´tFÑP_ÊkÒ=ÜaqÑû©z£;¨A®Ñ õ»1?¸¨9gµÒz×ÃNx[×È/ý6H/ ¦³ó©Qù¹ÑCYî»Sw¹{TfdÓíUþhÛvPÇdjÓÊ?Yâ`=+Ò}²Vt lEËjÌÇð}¦6¢+M©ÃSÛk|ô§K»SǽW{F:~Só<oÃhÔx3«>D»µÍ±âbÇÏÑÊeq=A½Li8½>9³z
This is the python code:
def data_getter():
driver.get(url)
images_list = []
titles_list = []
pages_list = []
price_list = []
scroll_pause_time = 0.5
i = 0
last_height = self.driver.execute_script("return document.body.scrollHeight")
while True:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(scroll_pause_time)
new_height = self.driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
i += 1
if i == 5:
break
wait = WebDriverWait(self.driver, 10)
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "ul.card-deck-container")))
self.driver.implicitly_wait(10)
for item in self.driver.find_elements(By.CSS_SELECTOR, group_selector):
title_txt = ""
page_url = ""
price_txt = ""
try:
image = item.find_element(By.CSS_SELECTOR, image_selector)
image_url = image.get_attribute(image_attr)
if title_selector is not None:
title = item.find_element(By.CSS_SELECTOR, title_selector)
title_txt = title.get_attribute('alt')
if product_page_selector is not None:
page = item.find_element(By.CSS_SELECTOR, product_page_selector)
page_url = page.get_attribute('href')
if price_selector is not None:
price = item.find_element(By.CSS_SELECTOR, price_selector)
price_txt = price.text
price_list.append(price_txt)
else:
price_txt = None
price_list.append(price_txt)
images_list.append(image_url)
titles_list.append(title_txt)
pages_list.append(page_url)
except Exception as e:
print(e)
print(len(images_list))
print(len(titles_list))
print(len(pages_list))
print(len(price_list))
print(images_list)
This is what it prints (it actually prints the URLs but I removed them so my post isn't flagged as spam):
89
89
89
89
['curvo-snow-sofa.jpg',
'gwyneth-boucle-loveseat.jpg',
'camden-white-sofa.jpg',
'outline-sofa.jpg',
'muir-grey-woven-curved-sofa.jpg',
'muir-camel-velvet-curved-sofa.jpg',
'lenyx-saddle-leather-sofa.jpg',
'bacio-sofa.jpg',
'white_15x15_loader',
.
.
.
'white_15x15_loader',
'white_15x15_loader']