-1

I ran the codes below. Most of the codes work, but when I ran the "for elm in collect" block, I got an error: HTTPError: HTTP Error 403: Forbidden. Can anyone help with this? Thanks!!

import requests
from bs4 import BeautifulSoup
import urllib.request
import os


resp = requests.get('https://www.williams.edu/institutional-research/common-data-set/', 
headers={'User-Agent': 'Mozilla/5.0'}) 
soup = BeautifulSoup(resp.text, 'html5lib')
links = [a['href'] for a in soup.select('li a[href]')]
collect = [] 
for link in links:
    if "https://www.williams.edu/institutional-research/files/" in link:
        collect.append(link)

for elm in collect:
    def main():
        download_file(elm) # the elm is an url.
    def download_file(download_url): # the download_url is the elm.
        save_path = 'C:/Users/WM' 
        file_name = elm.split("/")[-1]
        complete_name = os.path.join(save_path, file_name)
        response = urllib.request.urlopen(download_url)  
        file = open(complete_name, 'wb') 
        file.write(response.read())
        file.close()
        print("Completed")


   if __name__ == "__main__":
        main()
Oasis101
  • 11
  • 2
  • 2
    I hope stackoverflow.com/questions/16627227/problem-http-error-403-in-python-3-web-scraping will be helpful – sachin Mar 29 '22 at 06:05
  • 2
    [Problem HTTP error 403 in Python 3 Web Scraping](https://stackoverflow.com/q/16627227) (above link but hyperlinked) – SuperStormer Mar 29 '22 at 06:10
  • Does this answer your question? [Problem HTTP error 403 in Python 3 Web Scraping](https://stackoverflow.com/questions/16627227/problem-http-error-403-in-python-3-web-scraping) – Robert Mar 29 '22 at 14:05

1 Answers1

0

Not sure why there is a mixed use of requests and urllib in your code - Just request the download_url in loop as you do with initial url and add some header:

response = requests.get(download_url, headers={'User-Agent': 'Mozilla/5.0'})  

Example

import requests
from bs4 import BeautifulSoup
import os


resp = requests.get('https://www.williams.edu/institutional-research/common-data-set/', 
headers={'User-Agent': 'Mozilla/5.0'}) 
soup = BeautifulSoup(resp.text, 'html5lib')
links = [a['href'] for a in soup.select('li a[href]')]
collect = [] 
for link in links:
    if "https://www.williams.edu/institutional-research/files/" in link:
        collect.append(link)

for elm in collect:
    def main():
        download_file(elm) # the elm is an url.
    def download_file(download_url): # the download_url is the elm.
        save_path = 'C:/Users/WM' 
        file_name = elm.split("/")[-1]
        complete_name = os.path.join(save_path, file_name)
        response = requests.get(download_url, headers={'User-Agent': 'Mozilla/5.0'})  
        file = open(complete_name, 'wb') 
        file.write(response.read())
        file.close()
        print("Completed")


    if __name__ == "__main__":
        main()
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • Hi, I tried your codes, and got another error: AttributeError: 'Response' object has no attribute 'read'. How do I fix it? Thank you! – Oasis101 Mar 29 '22 at 22:44