Pandas - Creating Dataframe from Generator object using read_csv

Question

I have a propitiatory cursor (arcpy.da.SearchCursor) object that I need to load into a pandas dataframe.

It implements next(), reset() as you would expect for a generator object in Python.

Using another post in stackexchange, which is brilliant, I created a class that makes the generator act like a file-like object. This works for the default case, where chunksize is not set, but when I go to set the chunk size for each dataframe, it crashes python.

My guess is that the n=0 needs to be implemented so x number of rows are returned, but so far this has been wrong.

What is the proper way to implement my class so I can use generators to load a dataframe? I need to use chunksize because my datasets are huge.

So the pseudo code would be:

customfileobject = Reader(cursor)
dfs = pd.read_csv(customfileobject, columns=cursor.fields,
                  chunksize=10000)

I am using Pandas version 0.16.1 and Python 2.7.10.

Class below:

class Reader(object):

    """allows a cursor object to be read like a filebuffer"""
    def __init__(self, fc=None, columns="*", cursor=None):
        if cursor or fc:
            if fc:
                self.g = arcpy.da.SearchCursor(fc, columns)
            else:
                self.g = cursor
        else:
            raise ValueError("You must provide a da.SearchCursor or table path and column names")
    def read(self, n=0):
        try:
            vals = []
            if n == 0:
                return next(self.g)
            else:
                # return multiple rows?
                for x in range(n):
                   try:
                      vals.append(self.g.next())
                   except StopIteration:
                      return ''
        except StopIteration:
            return ''
    def reset(self):
        self.g.reset()

Would it work if you implement `read(self)` to read only one entry at a time? — ptrj, Jul 27 '16 at 14:12
I assume you mean `pd.read_csv`: `pd.from_csv` does not admit a `chunksize` argument. — Alicia Garcia-Raboso, Jul 27 '16 at 17:00
And what happens if you define `read` literally as in the post you linked? — ptrj, Jul 27 '16 at 19:38

score 1 · Accepted Answer · answered Jul 27 '16 at 18:22

Try the following read function:

def read(self, n=0):
    if n == 0:
        try:
            return next(self.g)
        except StopIteration:
            return ''
    else:
        vals = []
        try:
            for x in range(n):
                vals.append(next(self.g))
        except StopIteration:
            pass
        finally:
            return ''.join(vals)

You should tell pd.read_csv the column names using the names argument (not columns), and that you have no header row (header=None).

Pandas - Creating Dataframe from Generator object using read_csv

1 Answers1