Python crawler encounters error.HTTPError: HTTP Error 403: Forbidden

Question

Python code also added User-Agent, but the operation will still be the following error, what is the solution? The Request Header obtained from the browser has been added. It is still useless.ps: manually open the web page, you can access normally, but the code sends a request, prompt 403:

import requests, time, os, urllib.request, socket
from bs4 import BeautifulSoup

def getimg():
    os.system("mkdir Pic")
    headers = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
           "Accept-Encoding": "gzip, deflate",
           "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7",
           "Cache-Control": "max-age=0",
           "Connection": "keep-alive",
           "Host": "cc.itbb.men",
           "Upgrade-Insecure-Requests": "1",
           "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}
    r = requests.get("http://www.testowne.er/htm_data/8/1804/3099535.html", headers=headers)
    r.encoding = 'GBK'
    soup = BeautifulSoup(r.text, "html.parser")
    iname = 0
    for i in soup.find_all("input", type="image"):
        iname += 1
        i = i['src']
        print(i)
        urllib.request.urlretrieve(i, ".\\Pic\\%s" % str(iname))

========================output==============================================

Traceback (most recent call last):
  File "getimg.py", line 70, in <module>
    getimg()
  File "getimg.py", line 41, in getimg
    urllib.request.urlretrieve(i, ".\\Pic\\%s" % str(iname))
  File "/usr/lib/python3.5/urllib/request.py", line 188, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 472, in open
    response = meth(req, response)
  File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.5/urllib/request.py", line 510, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

score 1 · Accepted Answer · answered Apr 08 '18 at 13:21

As explained in this answer:

This website is blocking the user-agent used by urllib, so you need to change it in your request. Unfortunately I don't think urlretrieve supports this directly.

However using shutil.copyfileobj() to save the file didn't work for me. I used this instead:

r_img = requests.get(url, stream=True)
if r_img.status_code == 200:
    with open("img.jpg", 'wb') as f:
        f.write(r_img.content)

Full code:

import os

import requests
from bs4 import BeautifulSoup


def download_images(url: str) -> None:
    os.system('mkdir Pictures')
    r = requests.get(url)
    r.encoding = 'GBK'
    soup = BeautifulSoup(r.text, 'html.parser')

    for i, img in enumerate(soup.find_all('input', type='image')):
        img_url = img['src']
        print(i, img_url)
        r_img = requests.get(img_url, stream=True)
        if r_img.status_code == 200:
            with open(f'Pictures/pic{i}.jpg', 'wb') as f:
                f.write(r_img.content)


download_images('http://cc.itbb.men/htm_data/8/1804/3099535.html')

Notice usage of f-string to format the path. It is available for Python 3.6+, if you use older version of Python you can change to either % or .format(). Type hints I added to the function signature is the feature for Python 3.5+. You can also omit them, if you use older Python.

Python crawler encounters error.HTTPError: HTTP Error 403: Forbidden

1 Answers1