1

I have never used web scraping, but now I think it's the only thing that can help me in what I am trying to do. So I looked at a sample code on the internet. This accepted answer on StackOverflow seemed to be the one I am looking for: Download all pdf files from a website using Python

That wasn't working and was giving me a "403 forbidden error" because as @andrej Kesely said: I had to specify the User-Agent

Then I Updated the question after his answer:

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

# an example of a working url
#url = "http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html"
# my url (still not working)
url = 'http://www.covidmaroc.ma/Pages/LESINFOAR.aspx'

#You can use http://httpbin.org/get to see User-Agent in your browser. mine is
headers = {'User-Agent': 'Mozilla/5.0'} #Mozilla/5.0 #Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36"'}

#If there is no such folder, the script will create one automatically
folder_location = 'webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for a in soup.select("a[href$='.pdf']"):
    filename = os.path.join(folder_location,a['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,a['href'])).content)

Now it worked without errors and created the PDF files. But when I tried to open any PDF file, it just couldn't be opened in any pdf reader I have, even in chrome, it says " Error: Failed to load pdf document ". Also, the scraped PDFs are only 179bytes, whereas the "manually" downloaded PDFs are 1.XX Mb

mac179
  • 1,540
  • 1
  • 14
  • 24

1 Answers1

1

Try to specify User-Agent in request headers=:

import requests
from bs4 import BeautifulSoup


url = 'http://www.covidmaroc.ma/Pages/LESINFOAR.aspx'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for a in soup.select("a[href$='.pdf']"):
    print(a['href'])

Prints:

...

/Documents/BULLETIN/BQ_SARS-CoV-2.5.9.20.pdf
/Documents/BULLETIN/BQ_SARS-CoV-2.4.9.20.pdf
/Documents/BULLETIN/BQ_SARS-CoV-2.4.9.20.pdf
/Documents/BULLETIN/BULLETIN%20COVID-19Quotidien_03092020.pdf
/Documents/BULLETIN/BULLETIN%20COVID-19Quotidien_03092020.pdf

EDIT: Also, put headers= into your last requests.get():

...
with open(filename, 'wb') as f:
    f.write(requests.get(urljoin(url,a['href']), headers=headers).content)
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Hi, Thank you for the feedback, I have tried your solution, it worked without errors and created the PDF files. But when I tried to open any PDF file, it just couldn't be opened in any pdf reader I have, even in chrome, it says " Error: Failed to load pdf document ". Also, the scraped PDFs are only 179bytes, whereas the "manually" downloaded PDFs are 1.XX Mb – mac179 Mar 11 '21 at 12:33
  • 1
    @AmineChadi Probably you need to supply `headers=` into `f.write(requests.get(urljoin(url,a['href']), headers=headers).content)` as well. – Andrej Kesely Mar 11 '21 at 13:01
  • 1
    Yes, Thanks a lot, that was really helpful, update your answer so I will check it as the right answer. – mac179 Mar 11 '21 at 13:10