2

I'm learning web scraping using Requests and Beautiful Soup with Python3.

I tried to extract information from diferent web sites and I had no problems.

However, I visited the packtpub.com site (https://www.packtpub.com/) and when sending a request using requests in order to hold the content of the whole site in a variable I got the following message:

import requests
url = 'https://www.packtpub.com/'
req = requests.get(url)
req.raise_for_status()
reqText = req.text
print(reqText)

"requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.packtpub.com/" 

Later, I searched all Python's books and I sent a request using the URL of the first page of the results. https://search.packtpub.com/?query=python&refinementList%5Breleased%5D%5B0%5D=Available

In this case I didn't get an exception but I noticed that the content hold in the variable was not all of it. I mean that using an element inspector like the one in Mozilla Firefox I could get information about titles, authors, format etc. but this information was not stored in my variable.

I thought it was possible to extract information from the public content of any web site.

My questions are: Can companies limit what can be scraped from their sites? Is it always allowed to scrap the public content of web sites or there are some legal issues to take in consideration?

It surprised me the fact that the element inspector let me know the whole content but the requests library doesn't have access to all of it.

dangimar
  • 21
  • 2
  • It is good to know that everything that gets downloaded by a browser, can be downloaded by an "bot". Howerver this can de difficult, some websites have a lot of header information/checking to show the correct data. This makes it difficult to use the techniques you are using. So in short: it is not possible to limit what users can download. It can be very difficult to create a valid "fake" request for the data – H.J. Meijer May 28 '18 at 11:37

1 Answers1

0

In this case the website requires the User-Agent header. The default behaviour in requests doesn't send this header, check this post. The following set User-Agent header's value to Mozilla :

import requests
url = 'https://www.packtpub.com/'
req = requests.get(url, headers= {"User-Agent": "Mozilla"})
req.raise_for_status()
reqText = req.text
print(reqText)

Note that some website automatically rejects requests with no User-Agent header or requests including User-Agent value such as curl or wget which are likely to come from bot. Check this guide about preventing webscraping which helps to understand some techniques used by website against bot

Bertrand Martel
  • 42,756
  • 16
  • 135
  • 159