I'm learning web scraping using Requests and Beautiful Soup with Python3.
I tried to extract information from diferent web sites and I had no problems.
However, I visited the packtpub.com site (https://www.packtpub.com/) and when sending a request using requests in order to hold the content of the whole site in a variable I got the following message:
import requests
url = 'https://www.packtpub.com/'
req = requests.get(url)
req.raise_for_status()
reqText = req.text
print(reqText)
"requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.packtpub.com/"
Later, I searched all Python's books and I sent a request using the URL of the first page of the results. https://search.packtpub.com/?query=python&refinementList%5Breleased%5D%5B0%5D=Available
In this case I didn't get an exception but I noticed that the content hold in the variable was not all of it. I mean that using an element inspector like the one in Mozilla Firefox I could get information about titles, authors, format etc. but this information was not stored in my variable.
I thought it was possible to extract information from the public content of any web site.
My questions are: Can companies limit what can be scraped from their sites? Is it always allowed to scrap the public content of web sites or there are some legal issues to take in consideration?
It surprised me the fact that the element inspector let me know the whole content but the requests library doesn't have access to all of it.