Why can't I scrape Amazon products by BeautifulSoup?

Question

I am trying to scrape the heading of this Amazon listing. The code I wrote is working for some other Amazon listings, but not working for the url mentioned in the code below.

Here is the python code I've tried:

 import requests
from bs4 import BeautifulSoup
url="https://www.amazon.in/BULLMER-Cotton-Printed-T-shirt-Multicolour/dp/B0892SZX7F/ref=sr_1_4?c=ts&dchild=1&keywords=Men%27s+T-Shirts&pf_rd_i=1968024031&pf_rd_m=A1VBAL9TL5WCBF&pf_rd_p=8b97601b-3643-402d-866f-95cc6c9f08d4&pf_rd_r=EPY70Y57HP1220DK033Y&pf_rd_s=merchandised-search-6&qid=1596817115&refinements=p_72%3A1318477031&s=apparel&sr=1-4&ts_id=1968123031"
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0"}

page = requests.get(url, headers=headers)
print(page.status_code)
soup = BeautifulSoup(page.content, "html.parser")
#print(soup.prettify()) 
title = soup.find(id = "productTitle")
if title:
    title = title.get_text()
else:
    title = "default_title"
print(title)

Output:

200
default_title

html code from inspector tools:

<span id="productTitle" class="a-size-large product-title-word-break">
BULLMER Mens Halfsleeve Round Neck Printed Cotton Tshirt - Combo Tshirt - Pack of 3
</span>

Make sure that HTML is actually returned in the request and not populated by js when you view the browser. — jordanm, Aug 07 '20 at 16:37
only the author gets notified of comments, if you want to call someone's attention you can do it with @dimay I have a follow-up question — RichieV, Aug 07 '20 at 17:14
https://stackoverflow.com/questions/8287628/proxies-with-python-requests-module — dimay, Aug 07 '20 at 17:22

score 3 · Accepted Answer · answered Aug 07 '20 at 18:05

First, As others have commented, use a proxy service. Second in order to go amazon product page if you have an asin that's enough.

Amazon follows this url pattern for all product pages.

https://www.amazon.(com/in/fr)/dp/<asin>

import requests
from bs4 import BeautifulSoup
url="https://www.amazon.in/dp/B0892SZX7F"
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}

page = requests.get(url, headers=headers)
print(page.status_code)
soup = BeautifulSoup(page.content, "html.parser")
 
title = soup.find("span", {"id":"productTitle"})
if title:
    title = title.get_text(strip=True)
else:
    title = "default_title"

print(title)

Output:

200
BULLMER Mens Halfsleeve Round Neck Printed Cotton Tshirt - Combo Tshirt - Pack of 3

I tried your code, same error. Kindly check the image: https://imgur.com/a/F77uhj9 — ppxx, Aug 07 '20 at 20:31
@BilalKhan You would have got a captcha in your html. So, better use a proxy — bigbounty, Aug 07 '20 at 20:32

ppxx · Answer 2 · 2021-12-11T20:17:39.570

this worked fine for me:

import requests
from bs4 import BeautifulSoup
url="https://www.amazon.in/BULLMER-Cotton-Printed-T-shirt-Multicolour/dp/B0892SZX7F/ref=sr_1_4?c=ts&dchild=1&keywords=Men%27s+T-Shirts&pf_rd_i=1968024031&pf_rd_m=A1VBAL9TL5WCBF&pf_rd_p=8b97601b-3643-402d-866f-95cc6c9f08d4&pf_rd_r=EPY70Y57HP1220DK033Y&pf_rd_s=merchandised-search-6&qid=1596817115&refinements=p_72%3A1318477031&s=apparel&sr=1-4&ts_id=1968123031"
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0"}
http_proxy  = "http://10.10.1.10:3128"
https_proxy = "https://10.10.1.11:1080"
ftp_proxy   = "ftp://10.10.1.10:3128"

proxyDict = { 
              "http"  : http_proxy, 
              "https" : https_proxy, 
              "ftp"   : ftp_proxy
            }
page = requests.get(url, headers=headers)
print(page.status_code)
soup = BeautifulSoup(page.content, "lxml")
#print(soup.prettify()) 

title = soup.find(id = "productTitle")
if title:
    title = title.get_text()
else:
    title = "default_title"
print(title)

Why can't I scrape Amazon products by BeautifulSoup?

2 Answers2

Linked