2

I have a problem on scraping an e-commerce site using BeautifulSoup. I did some Googling but I still can't solve the problem.

Please refer on the pictures:

1 Chrome F12 : enter image description here

2 Result : enter image description here

Here is the site that I tried to scrape: "https://shopee.com.my/search?keyword=h370m"

Problem:

  1. When I tried to open up Inspect Element on Google Chrome (F12), I can see the for the product's name, price, etc. But when I run my python program, I could not get the same code and tag in the python result. After some googling, I found out that this website used AJAX query to get the data.

  2. Anyone can help me on the best methods to get these product's data by scraping an AJAX site? I would like to display the data in a table form.

My code:

import requests
from bs4 import BeautifulSoup
source = requests.get('https://shopee.com.my/search?keyword=h370m')
soup = BeautifulSoup(source.text, 'html.parser')
print(soup)
mzjn
  • 48,958
  • 13
  • 128
  • 248
Firdhaus Saleh
  • 189
  • 2
  • 13

2 Answers2

6

Welcome to StackOverflow! You can inspect where the ajax request is being sent to and replicate that.

In this case the request goes to this api url. You can then use requests to perform a similar request. Notice however that this api endpoint requires a correct UserAgent header. You can use a package like fake-useragent or just hardcode a string for the agent.

import requests

# fake useragent
from fake_useragent import UserAgent
user_agent = UserAgent().chrome

# or hardcode
user_agent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1468.0 Safari/537.36'

url = 'https://shopee.com.my/api/v2/search_items/?by=relevancy&keyword=h370m&limit=50&newest=0&order=desc&page_type=search'
resp = requests.get(url, headers={
    'User-Agent': user_agent
})
data = resp.json()
products = data.get('items')
dmitrybelyakov
  • 3,709
  • 2
  • 22
  • 26
  • Thank you sir. I'm sorry but how do I get the API url that is sent by the request? To be specific, where can I get the API call in that particular page? Is it by using the 'Inspect Element' on chrome? – Firdhaus Saleh Jan 30 '19 at 03:41
  • @FirdhausSaleh In chrome you can open up developer tools, then go to Network tab and select XHR to only show xhr requests. You will also be able to click requests and inspect the responses right there in the network tab. – dmitrybelyakov Jan 30 '19 at 14:54
  • 2
    Before I forgot, thank you sir. I solved it. And yes, it was about the header things. Again, Thank you very much – Firdhaus Saleh Jun 30 '19 at 10:54
4

Welcome to StackOverflow! :)

As an alternative, you can check Selenium

See example usage from documentation:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

When you use requests (or libraries like Scrapy) usually JavaScript not loaded. As @dmitrybelyakov mentioned you can reply these calls or imitate normal user interaction using Selenium.

KenanBek
  • 999
  • 1
  • 14
  • 21