-3

I'm using urllib.request.urlopen to query the URL http://dblp.org/db/conf/lak/index. For some reason I cannot access the site using the Python module urllib, because I receive the following HTTP Status Code error:

HTTPError: HTTP Error 406: Not Acceptable

Here is the code that I'm using to make this request:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'http://dblp.org/db'
html = urlopen(url).read()
soup = BeautifulSoup(html)
print(soup.prettify())

I'm unsure what is causing this error and I need assistance in solving this error.

Below is the Stack Trace related to this error:

HTTPError                                 Traceback (most recent call last)
<ipython-input-5-b158a1e893a0> in <module>
----> 1 html = urlopen("https://dblp.org/db").read()
      2 #print(html)
      3 soup = BeautifulSoup(html)
      4 soup.prettify()

~\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

~\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout)
    529         for processor in self.process_response.get(protocol, []):
    530             meth = getattr(processor, meth_name)
--> 531             response = meth(req, response)
    532 
    533         return response

~\Anaconda3\lib\urllib\request.py in http_response(self, request, response)
    639         if not (200 <= code < 300):
    640             response = self.parent.error(
--> 641                 'http', request, response, code, msg, hdrs)
    642 
    643         return response

~\Anaconda3\lib\urllib\request.py in error(self, proto, *args)
    567         if http_err:
    568             args = (dict, 'default', 'http_error_default') + orig_args
--> 569             return self._call_chain(*args)
    570 
    571 # XXX probably also want an abstract factory that knows when it makes

~\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args)
    501         for handler in handlers:
    502             func = getattr(handler, meth_name)
--> 503             result = func(*args)
    504             if result is not None:
    505                 return result

~\Anaconda3\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 406: Not Acceptable
Sri Test
  • 389
  • 1
  • 4
  • 21
  • It would be helpful if u provide the url of the website – Sushil Oct 06 '20 at 18:13
  • @Sushil This is the url https://dblp.org/db/conf/lak/index – Sri Test Oct 06 '20 at 18:23
  • Probably the http headers are wrong: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/406#:~:text=The%20HyperText%20Transfer%20Protocol%20(HTTP,to%20supply%20a%20default%20representation. What's the body of the response? Try to use the same headers the browser sends. – Thomas Sablik Oct 06 '20 at 18:24
  • @ThomasSablik When I open this normally it is working fine but when I work on it with my code,it throws 406 error.I already know what 406 error is but I don't know how I could resolve it in this case. – Sri Test Oct 06 '20 at 18:26
  • Fix your headers. You are either sending wrong headers or missing obligatory headers. – Thomas Sablik Oct 06 '20 at 18:26
  • @ThomasSablik If you have already taken a look at my code, it is not any complex with headers and all.It is just a simple BeautifulSoup code. – Sri Test Oct 06 '20 at 18:28
  • Then you have to add some headers. What is the content of the response body? – Thomas Sablik Oct 06 '20 at 18:29
  • Which is why it's not working. You need to mimic the headers that the website needs. If you don't know what those are, run a tool like WireShark or Telerik Fiddler to see what headers the website is actually using. – Robert Harvey Oct 06 '20 at 18:29
  • @RobertHarvey I need to solve this programmatically.Could you give a programatic approach since I have already mentioned the link? – Sri Test Oct 06 '20 at 18:32
  • @ThomasSablik it is html source code – Sri Test Oct 06 '20 at 18:33
  • Set all the headers your browser sends. Some headers are necessary. Please post the response body. – Thomas Sablik Oct 06 '20 at 18:33
  • Have you first identified the headers the site needs, as I explained before? See [here](https://stackoverflow.com/a/43441551/102937) for information on how to set the headers. – Robert Harvey Oct 06 '20 at 18:36
  • Please edit critical information into the question; don't leave it dangling in the comments. – Prune Oct 06 '20 at 18:37
  • @RobertHarvey since I am using urlopen,how do I set my headers there?Could you find any way? – Sri Test Oct 06 '20 at 18:40
  • [How do I set headers using python's urllib?](https://stackoverflow.com/q/7933417/102937) – Robert Harvey Oct 06 '20 at 18:42

1 Answers1

2

I'm looking into the 406 error code, which happens when the server cannot respond with the accept-header specified in the request. If I can get urlopen to work correctly, I will post that answer too.

I don't get this error when using Python Requests

import requests
from bs4 import BeautifulSoup

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
raw_html = requests.get('http://dblp.org/db/conf/lak/index')
soup = BeautifulSoup(raw_html.content, 'html.parser')
print(soup.prettify())

The answer below uses urlopen, which does not produce the 406 error.

from urllib.request import Request
from urllib.request import urlopen
from bs4 import BeautifulSoup

raw_request = Request('https://dblp.org/db/conf/lak/index')
raw_request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0')
raw_request.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
resp = urlopen(raw_request)
raw_html = resp.read()
soup = BeautifulSoup(raw_html, 'html.parser')
print(soup.prettify())
Life is complex
  • 15,374
  • 5
  • 29
  • 58
  • could you also take a look at my other question. https://stackoverflow.com/questions/64231413/how-to-get-papers-from-semantic-scholar-api-using-paper-title/64231785?noredirect=1#comment113582539_64231785 – Sri Test Oct 07 '20 at 12:10