0

I'm trying to scrape shopee product names, prices and images. However, I can't seem to extract the images. Is it because of the html? I just can't seem to find the class for images in dataImg

import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

driver =webdriver.Chrome('chromedriver')

products=[]
prices=[]
images=[]

driver.get('https://shopee.co.id/search?keyword=laptop')

content=driver.page_source
soup=BeautifulSoup(content)
soup

for link in soup.find_all('div',class_="_3EfFTx"):
    print('test')
    print(link)

for link in soup.find_all('div',class_="_3EfFTx"):
    #print(link)
    dataImg=link.find('img',class_="_1T9dHf V1Fpl5")
    print(dataImg)
    name=link.find('div',class_="_1Sxpvs")
    #print(name.get_text())
    price=link.find('div',class_="QmqjGn")
    #print(price.get_text())
    
    if dataImg is not None:
        products.append(name.get_text())
        prices.append(price.get_text())
        images.append(dataImg['src'])

df=pd.DataFrame({'Product Name':products,'Price':prices,'Images':images})
df
Clairine
  • 23
  • 5

2 Answers2

0

The website uses JS to load the images, to bypass this, you need selenium with a small delay. Here is the code to download the image src:

from selenium import webdriver
from time import sleep

products=[]
prices=[]
images=[]

driver = webdriver.Chrome(r'F:\Sonstiges\chromedriver\chromedriver.exe')
driver.get('https://shopee.co.id/search?keyword=laptop')

sleep(8)
imgs = driver.find_elements_by_class_name('_1T9dHf')
for img in imgs:
    img_url = img.get_attribute("src")
    if img_url:
        print(img_url)
driver.quit()

In order to get the image, just do this using the fetched URIs. If you use Beautiful soup just because it runs in the background, is here the soloution for running selenium headless (in the background).

Frederick
  • 450
  • 4
  • 22
0

What happens?

You grab the source while not all content is loaded. If you wait a bit longer, this wont help cause only the first images are loaded, the rest of images will only be loaded if they come into view.

How to fix that?

You have to wait a bit and than scroll in steps down to the bottom of the page:

time.sleep(5)
for i in range(10):
    driver.execute_script("window.scrollBy(0, 350)")
    time.sleep(1) 

Example

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
   
driver =webdriver.Chrome('chromedriver')

products=[]
prices=[]
images=[]

driver.get('https://shopee.co.id/search?keyword=laptop')

time.sleep(5)
for i in range(10):
    driver.execute_script("window.scrollBy(0, 350)")
    time.sleep(1)
    
content=driver.page_source
soup=BeautifulSoup(content)

for item in soup.select('div[data-sqe="item"]'):
    dataImg=item.img
    name=item.find('div',class_="_1Sxpvs")
    price=item.find('div',class_="QmqjGn")
    
    if dataImg is not None:
        products.append(name.get_text())
        prices.append(price.get_text())
        images.append(dataImg['src'])

df=pd.DataFrame({'Product Name':products,'Price':prices,'Images':images})
df 

Output

Product Name    Price   Images
0   [ACQ] Meja Laptop Lipat Portable    Rp51.990    https://cf.shopee.co.id/file/83a9e6e8ecad7a3db...
1   LENOVO Thinkpad CORE i5 Ram 8GB/ 2TB/1TB/500GB...   Rp2.100.000 - Rp4.200.000   https://cf.shopee.co.id/file/44fbc24f5c585cda1...
2   HP Laptop 14s-cf3076TU/i3-1005G1/256GB SSD/14"...   Rp6.599.000Rp6.598.999  https://cf.shopee.co.id/file/170a45679aa5002f1...
...
HedgeHog
  • 22,146
  • 4
  • 14
  • 36