1

The pandas.read_csv function is very flexible and recently has begun to support URL inputs as described here

df = pd.read_csv('http://www.somefile.csv')

I am attempting to find in the source code where this case is handled. Here is what I know so far:

1) read_csv is a fairly generic wrapper generated by _make_parser_function within io/parsers.py

2) The function produced by _make_parser_function delegates the reading of the data to a function _read(filepath_or_buffer, kwds) that is defined elsewhere in io/parsers.py

3) This function _read(filepath_or_buffer, kwds) creates a TextFileReader and returns the result of TextFileReader.read(). However, it appears that TextFileReader is responsible for, well, only text files. It provides functionality for handling various types of compression, but I see nothing checking for URL input.

4) On the other hand,io/html.py contains a function _read(obj) that clearly is to access a URL and return the result of an http query.

It seems to me that the simple solution to this problem is to check if the input string is a URL, and if so dispatch to the html module; however, I cannot find where this happens when tracing through read_csv. Could anybody point me in the right direction?

AJ Pryor
  • 53
  • 1
  • 6

1 Answers1

1

You missed a step between 2 and 3.

2.5) _read calls get_filepath_or_buffer() where the url is recognized as such and read.

filepath_or_buffer, _, compression = get_filepath_or_buffer(
    filepath_or_buffer, encoding, compression)

get_filepath_or_buffer() is defined in pandas.io.common:

def get_filepath_or_buffer(filepath_or_buffer, encoding=None,
                           compression=None):
    """
    If the filepath_or_buffer is a url, translate and return the buffer.
    Otherwise passthrough.
    Parameters
    ----------
    filepath_or_buffer : a url, filepath (str, py.path.local or pathlib.Path),
                         or buffer
    encoding : the encoding to use to decode py3 bytes, default is 'utf-8'
    Returns
    -------
    a filepath_or_buffer, the encoding, the compression
    """
    ...
Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119
  • Thanks -- the relevant line is literally the second line of code within the function you pointed out after the docstring `if _is_url(filepath_or_buffer):` – AJ Pryor Aug 16 '17 at 19:24