2

I'm trying to scrape all the HTML elements of a page using requests & beautifulsoup. I'm using ASIN (Amazon Standard Identification Number) to get the product details of a page. My code is as follows:

from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup

url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = urlopen(url)
soup = BeautifulSoup(response, "html.parser")
print(soup)

But the output doesn't show the entire HTML of the page, so I can't do my further work with product details. Any help on this?

EDIT 1:

From the given answer, It shows the markup of the bot detection page. I researched a bit & found two ways to breach it :

  1. I might need to add a header in the requests, but I couldn't understand what should be the value of header.
  2. Use Selenium. Now my question is, do both of the ways provide equal support?
Proteeti Prova
  • 1,079
  • 4
  • 25
  • 49

3 Answers3

9

It is better to use fake_useragent here for making things easy. A random user agent sends request via real world browser usage statistic. If you don't need dynamic content, you're almost always better off just requesting the page content over HTTP and parsing it programmatically.

import requests
from fake_useragent import UserAgent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
ua=UserAgent()
hdr = {'User-Agent': ua.random,
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
      'Accept-Encoding': 'none',
      'Accept-Language': 'en-US,en;q=0.8',
      'Connection': 'keep-alive'}
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = requests.get(url, headers=hdr)
print response.content

Selenium is used for browser automation and high level web scraping for dynamic contents.

fdermishin
  • 3,519
  • 3
  • 24
  • 45
Mutasim Sadi
  • 121
  • 1
  • 4
3

As some of the comments already suggested, if you need to somehow interact with Javascript on a page, it is better to use selenium. However, regarding your first approach using a header:

import requests
from bs4 import BeautifulSoup

url = "http://www.amazon.com/dp/" + 'B004CNH98C'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text,"html.parser")

These headers are a bit old, but should still work. By using them you are pretending that your request is coming from a normal webbrowser. If you use requests without such a header your code is basically telling the server that the request is coming from python, which most of the servers are rejecting right away.

Another alternative for you could also be fake-useragent maybe you can also have a try with this.

WurzelseppQX
  • 520
  • 1
  • 6
  • 17
  • 1
    I was confused if 'User-Agent' takes any predefined format to give my machine information. I came across this https://developers.whatismybrowser.com/useragents/explore/software_type_specific/web-browser/. I guess this will be header I pass, am I correct? – Proteeti Prova Aug 29 '18 at 06:46
  • Also from the docs, it says that custom made headers are given less precendence. Does it mean "less precedence" in terms of accepting the requests? – Proteeti Prova Aug 29 '18 at 06:49
  • 1
    From the list of browsers you posted you can select the header you want to use. Your request is then pretending to come from this browser. I haven't found the passage about "less precedence" so I can only assume what is meant, but in general the servers are mostly rejecting requests which look in some way automated in order to keep a good performance. This is why it is necessary to pretend to be a real browser so that the server is accepting your request. – WurzelseppQX Aug 29 '18 at 07:26
  • 2
    However these days most websites are providing APIs for people who want to use automated requests. This is actually good for both parties. API requests are better for server performance and also for you less code is necessary and it is much more straightforward. So in general I can recommend to check if a page is providing an API, before trying to parse it the "hacky" way. – WurzelseppQX Aug 29 '18 at 07:30
0

try this:

import requests
from bs4 import BeautifulSoup

url = "http://www.amazon.com/dp/" + 'B004CNH98C'
r = requests.get(url)
r = r.text

##options #1
#  print r.text

soup = BeautifulSoup( r.encode("utf-8") , "html.parser")

### options 2
print(soup)
Bryro
  • 222
  • 1
  • 14