10

When I look up my issue on Google or Stackoverflow, there seems to be a half dozen cases like this solved, however I never really seem to understand the solution.

So I want to scrape a .csv from a server with Jupyter Lab, launched with Anaconda. This file does exist and I can download it with a few clicks.

Now I try to execute the following queries:

import pandas as pd
pd.read_csv("link")

It produces the following error:

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-37-aae59f2238c3> in <module>
----> 1 pd.read_csv("https://first-python-notebook.readthedocs.io/_static/committees.csv")
/Applications/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    674         )
    675 
--> 676         return _read(filepath_or_buffer, kwds)
    677 
    678     parser_f.__name__ = name
/Applications/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    429     # See https://github.com/python/mypy/issues/1297
    430     fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
--> 431         filepath_or_buffer, encoding, compression
    432     )
    433     kwds["compression"] = compression
/Applications/anaconda3/lib/python3.7/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
    170 
    171     if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer):
--> 172         req = urlopen(filepath_or_buffer)
    173         content_encoding = req.headers.get("Content-Encoding", None)
    174         if content_encoding == "gzip":
/Applications/anaconda3/lib/python3.7/site-packages/pandas/io/common.py in urlopen(*args, **kwargs)
    139     import urllib.request
    140 
--> 141     return urllib.request.urlopen(*args, **kwargs)
    142 
    143 
/Applications/anaconda3/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):
/Applications/anaconda3/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
    529         for processor in self.process_response.get(protocol, []):
    530             meth = getattr(processor, meth_name)
--> 531             response = meth(req, response)
    532 
    533         return response
/Applications/anaconda3/lib/python3.7/urllib/request.py in http_response(self, request, response)
    639         if not (200 <= code < 300):
    640             response = self.parent.error(
--> 641                 'http', request, response, code, msg, hdrs)
    642 
    643         return response
/Applications/anaconda3/lib/python3.7/urllib/request.py in error(self, proto, *args)
    567         if http_err:
    568             args = (dict, 'default', 'http_error_default') + orig_args
--> 569             return self._call_chain(*args)
    570 
    571 # XXX probably also want an abstract factory that knows when it makes
/Applications/anaconda3/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    501         for handler in handlers:
    502             func = getattr(handler, meth_name)
--> 503             result = func(*args)
    504             if result is not None:
    505                 return result
/Applications/anaconda3/lib/python3.7/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 403: Forbidden

What works though, is when I try this instead:

f = requests.get(link)
print(f.text)

From reading other resources, it seems to me the issue could be that my user-agent is not correctly defined which makes the target server reject my request. The solution would be to add a correct or fake 'header', where I include my user_agent: https://www.whatismybrowser.com/detect/what-is-my-user-agent

So I tried this:

import http.cookiejar
from urllib.request import urlopen
site= "link"
hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
}
req = urllib2.Request(site, headers=hdr)
content = page.read()
print(content)

But first of all, it returns

NameError: name 'urllib2' is not defined

...which I can't find a working solution for.

Of course my main issue remains unsolved as well.

I don't really understand were my header is supposed to be set. Do you need to execute something like this for every file from the web anew? Isn't there a more general solution? Or is this even the actual problem I have?

w. Patrick Gale
  • 1,643
  • 13
  • 22
nextbear
  • 99
  • 1
  • 4

2 Answers2

12

Since 1.2 of pandas, it is possible to tune the used reader by adding options as dictionary keys to the storage_options parameter of read_csv. So by invoking it with

import pandas as pd

url = ''
storage_options = {'User-Agent': 'Mozilla/5.0'}
df = pd.read_csv(url, storage_options=storage_options)

the library will include the User-Agent header to the request so you don't have to set it up externally and before the invocation of read_csv.

Oliver Schupp
  • 403
  • 5
  • 12
6

This script should work with Python2/Python3 (there was a change with urllib2 in Python3):

import pandas as pd

try:
    from urllib.request import Request, urlopen  # Python 3
except ImportError:
    from urllib2 import Request, urlopen  # Python 2

req = Request('<YOUR URL WITH CSV>')
req.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0')
content = urlopen(req)

df = pd.read_csv(content)
print(df)
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    Great, that worked! :) Now let me know why, please :) – nextbear Jun 09 '20 at 09:03
  • @nextbear `urllib2` doesn't exists in Python3, so you get an error importing this module. – Andrej Kesely Jun 09 '20 at 09:33
  • okay, but why do need to go back to Python2 in the first place in order to make pd.read_cvs work? I am working through this tutorial and it assumes Python3: https://www.firstpythonnotebook.org/dataframe/index.html – nextbear Jun 09 '20 at 18:03
  • @nextbear I'm not going back to Python2. The code I've posted is just compatible with Python2/3 – Andrej Kesely Jun 09 '20 at 18:04
  • Ok, I misunderstood. Is there a way to make pd.read_csv accessible for me without using urllib2 or anything else not existing in Python3? – nextbear Jun 09 '20 at 18:34
  • @nextbear If you using Python3, the code I posted is using `urllib.request`, which is in standard library https://docs.python.org/3/library/urllib.request.html – Andrej Kesely Jun 09 '20 at 18:38
  • I see. Ok I need to use this code every time I want to scape a file from the web with jupyter? Seems like a quite hassle. Are there other options? – nextbear Jun 10 '20 at 08:11
  • Been a long time, but in case @nextbear or anyone is still trying to follow this: you don't *always* need this urllib incantation for pd.read_csv()! This is just a workaround for some web servers that are configured to respond only to requests that look like they came from a browser. – zgana Feb 17 '21 at 04:06