Can't get the true html with python requests

Question

When I use python urllib.request to parse a url, I got a 403 Forbidden.Here's the code:

import urllib.request
url='https://www.genecards.org/cgi-bin/carddisp.pl?gene=ERBB2'
headers=('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36')
opener=urllib.request.build_opener()
opener.addheaders=[headers]
urllib.request.install_opener(opener)
data=urllib.request.urlopen(url).read().decode('utf-8')
print(data)

Then I get an error:

Traceback (most recent call last):
  File "/Users/zhangqing/Documents/Yanpu/ERBB2 Gene.py", line 22, in <module>
    data=urllib.request.urlopen(url).read().decode('utf-8')
  File "/Users/zhangqing/anaconda/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/zhangqing/anaconda/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/Users/zhangqing/anaconda/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Users/zhangqing/anaconda/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/Users/zhangqing/anaconda/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/Users/zhangqing/anaconda/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

When I try requests from python, the code is:

import requests
import re
from requests.exceptions import RequestException

def get_page(url):
headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
try:
    res=requests.get(url,headers=headers)
    if res.status_code==200:
        return res.text
except RequestException:
    return None

html=get_page('https://www.genecards.org/cgi-bin/carddisp.pl?gene=ERBB2')
print(html)

I got a html like this:

Request unsuccessful. Incapsula incident ID: 461001240193404751-556133389381208009

It is not the real source code of the webpage. So what should I do to improve the code?

the first code you posted does not work `"AttributeError: module 'urllib' has no attribute 'request'"` — Baptiste Mille-Mathias, Aug 02 '18 at 03:17

score 1 · Answer 1 · answered Aug 02 '18 at 03:34

1

The web page is using Incapsula, and Incapsula has figured out that you're using a bot. See this question for some possible workarounds, or try and find genecard.com's public API if they have one.

answered Aug 02 '18 at 03:34

Jack Taylor

5,588
19
35

Thanks for your comment! I have improved my code. However I can only get a part of the html(from line 1581 to line 7535). – QqZhang Aug 02 '18 at 04:10
Try the requests with proxies. That can help i guess. – SanthoshSolomon Aug 02 '18 at 05:00

Can't get the true html with python requests

1 Answers1