I have never used web scraping, but now I think it's the only thing that can help me in what I am trying to do. So I looked at a sample code on the internet. This accepted answer on StackOverflow seemed to be the one I am looking for: Download all pdf files from a website using Python
That wasn't working and was giving me a "403 forbidden error" because as @andrej Kesely said: I had to specify the User-Agent
Then I Updated the question after his answer:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
# an example of a working url
#url = "http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html"
# my url (still not working)
url = 'http://www.covidmaroc.ma/Pages/LESINFOAR.aspx'
#You can use http://httpbin.org/get to see User-Agent in your browser. mine is
headers = {'User-Agent': 'Mozilla/5.0'} #Mozilla/5.0 #Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36"'}
#If there is no such folder, the script will create one automatically
folder_location = 'webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for a in soup.select("a[href$='.pdf']"):
filename = os.path.join(folder_location,a['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,a['href'])).content)
Now it worked without errors and created the PDF files. But when I tried to open any PDF file, it just couldn't be opened in any pdf reader I have, even in chrome, it says " Error: Failed to load pdf document ". Also, the scraped PDFs are only 179bytes, whereas the "manually" downloaded PDFs are 1.XX Mb