2

I am interested in streaming a custom object into a pandas dataframe. According to the documentation, any object with a read() method can be used. However, even after implementing this function I am still getting this error:

ValueError: Invalid file path or buffer object type: <class '__main__.DataFile'>

Here is a simple version of the object, and how I am calling it:

class DataFile(object):
    def __init__(self, files):
        self.files = files

    def read(self):
        for file_name in self.files:
            with open(file_name, 'r') as file:
                for line in file:
                    yield line

import pandas as pd
hours = ['file1.csv', 'file2.csv', 'file3.csv']

data = DataFile(hours)
df = pd.read_csv(data)

Am I missing something, or is it just not possible to use a custom generator in Pandas? When I call the read() method it works just fine.

EDIT: The reason I want to use a custom object rather than concatenating the dataframes together is to see if it is possible to reduce memory usage. I have used the gensim library in the past, and it makes it really easy to use custom data objects, so I was hoping to find some similar approach.

  • Even if this worked as documented, I doubt your `read` method would work, because generally, `read(x)` reads *x bytes from the buffer*. Instead, your `read` method returns a generator object. – juanpa.arrivillaga Sep 19 '17 at 21:34

3 Answers3

5

One way to make a file-like object in Python3 by subclassing io.RawIOBase. And using Mechanical snail's iterstream, you can convert any iterable of bytes into a file-like object:

import tempfile
import io
import pandas as pd

def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
    """
    http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
    Lets you use an iterable (e.g. a generator) that yields bytestrings as a
    read-only input stream.

    The stream implements Python 3's newer I/O API (available in Python 2's io
    module).

    For efficiency, the stream is buffered.
    """
    class IterStream(io.RawIOBase):
        def __init__(self):
            self.leftover = None
        def readable(self):
            return True
        def readinto(self, b):
            try:
                l = len(b)  # We're supposed to return at most this much
                chunk = self.leftover or next(iterable)
                output, self.leftover = chunk[:l], chunk[l:]
                b[:len(output)] = output
                return len(output)
            except StopIteration:
                return 0    # indicate EOF
    return io.BufferedReader(IterStream(), buffer_size=buffer_size)


class DataFile(object):
    def __init__(self, files):
        self.files = files

    def read(self):
        for file_name in self.files:
            with open(file_name, 'rb') as f:
                for line in f:
                    yield line

def make_files(num):
    filenames = []
    for i in range(num):
        with tempfile.NamedTemporaryFile(mode='wb', delete=False) as f:
            f.write(b'''1,2,3\n4,5,6\n''')
            filenames.append(f.name)
    return filenames

# hours = ['file1.csv', 'file2.csv', 'file3.csv']
hours = make_files(3)
print(hours)
data = DataFile(hours)
df = pd.read_csv(iterstream(data.read()), header=None)

print(df)

prints

   0  1  2
0  1  2  3
1  4  5  6
2  1  2  3
3  4  5  6
4  1  2  3
5  4  5  6
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
3

The documentation mentions the read method but it's actually checking if it's a is_file_like argument (that's where the exception is thrown). That function is actually very simple:

def is_file_like(obj):
    if not (hasattr(obj, 'read') or hasattr(obj, 'write')):
        return False
    if not hasattr(obj, "__iter__"):
        return False
    return True

So it also needs an __iter__ method.

But that's not the only problem. Pandas requires that it actually behaves file-like. So the read method should accept an additional argument for the number of bytes (so you can't make read a generator - because it has to be callable with 2 arguments and should return a string).

So for example:

class DataFile(object):
    def __init__(self, files):
        self.data = """a b
1 2
2 3
"""
        self.pos = 0

    def read(self, x):
        nxt = self.pos + x
        ret = self.data[self.pos:nxt]
        self.pos = nxt
        return ret

    def __iter__(self):
        yield from self.data.split('\n')

will be recognized as valid input.

However it's harder for multiple files, I hoped that fileinput could have some appropriate routines but it doesn't seem like it:

import fileinput

pd.read_csv(fileinput.input([...]))
# ValueError: Invalid file path or buffer object type: <class 'fileinput.FileInput'>
MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • I couldn't find the `is_file_like` function, because the import statement implied it would be at `pandas.core.dtypes.common`, but it was in `pandas.core.dtypes.inference` ... weird... Anyway, looking at the actual csv parsing code, I *believe* that it uses `.readline` if you pass `engine='python'` – juanpa.arrivillaga Sep 19 '17 at 21:57
  • 1
    Gah! And look what I found in `pandas.core.dtypes.common`: `from .inference import * # noqa` Yes, no quality-assurance indeed... – juanpa.arrivillaga Sep 19 '17 at 22:01
  • 1
    hm, I just used `pd.io.common.is_file_like.__module__` (the `pd.io.common` was the module where the function was called). :) – MSeifert Sep 19 '17 at 22:06
0

How about this alternative approach:

def get_merged_csv(flist, **kwargs):
    return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)

df = get_merged_csv(hours)
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419