45

I've create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns.

I've try to create a DataFrame from:

import pandas as pd
df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list)

but throws an error:

... 
C:\Anaconda\envs\py33\lib\site-packages\pandas\core\frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
   1046                 values.append(row)
   1047                 i += 1
-> 1048                 if i >= nrows:
   1049                     break
   1050 

TypeError: unorderable types: int() >= NoneType()

I managed it to work consuming the generator in a list, but uses twice memory:

df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list)

The files I want to load are big, and memory consumption matters. The last try my computer spends two hours trying to increment virtual memory :(

The question: Anyone knows a method to create a DataFrame from a record generator directly, without previously convert it to a list?

Note: I'm using python 3.3 and pandas 0.12 with Anaconda on Windows.

Update:

It's not problem of reading the file, my tuple generator do it well, it scan a text compressed file of intermixed records line by line and convert only the wanted data to the correct types, then it yields fields in a generator of tuples form. Some numbers, it scans 2111412 records on a 130MB gzip file, about 6.5GB uncompressed, in about a minute and with little memory used.

Pandas 0.12 does not allow generators, dev version allows it but put all the generator in a list and then convert to a frame. It's not efficient but it's something that have to deal internally pandas. Meanwhile I've must think about buy some more memory.

sophros
  • 14,672
  • 11
  • 46
  • 75
tinproject
  • 984
  • 2
  • 8
  • 11
  • The problem must be in `tuple_generator`, since the problem does not occur for simple generator expressions like `tuple_generator = (item for item in [[1,2,3],[2,3,4,5]])`. – unutbu Sep 20 '13 at 11:58
  • @unutbu Not on pandas 0.12. On the development version it works correctly. – Viktor Kerkez Sep 20 '13 at 12:02
  • 1
    It sounds like you might be experiencing [thrashing](http://en.wikipedia.org/wiki/Thrashing_(computer_science)), in which case you should consider adding more memory to your machine. – Phillip Cloud Sep 20 '13 at 12:24

6 Answers6

31

You certainly can construct a pandas.DataFrame() from a generator of tuples, as of version 0.19 (and probably earlier). Don't use .from_records(); just use the constructor, for example:

import pandas as pd
someGenerator = ( (x, chr(x)) for x in range(48,127) )
someDf = pd.DataFrame(someGenerator)

Produces:

type(someDf) #pandas.core.frame.DataFrame

someDf.dtypes
#0     int64
#1    object
#dtype: object

someDf.tail(10)
#      0  1
#69  117  u
#70  118  v
#71  119  w
#72  120  x
#73  121  y
#74  122  z
#75  123  {
#76  124  |
#77  125  }
#78  126  ~
Guilherme David da Costa
  • 2,318
  • 4
  • 32
  • 46
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • This question is from where pandas don't allow generators at all (pre 0.13). – tinproject Apr 28 '17 at 11:12
  • The usage of `.from_records()` is the correct for the use case of the question, as uses a generator of records. The default constructor don't get clear how the generator will be interpreted, if as a generator of records or as a generator of columns (series). – tinproject Apr 28 '17 at 11:22
  • Took a little creativity to work with CSV lines, but in case anyone else comes across the same issue, within my generator I used ```for line in lines: yield next(csv.reader([line])```. This was useful for me because I needed to perform some cleansing on each line and had other conditional logic to worry about within the CSV. – DarkHark May 25 '21 at 16:13
  • Dear @c8h10n4o2 please explain why one should choose using generator in this case? – SteveS May 13 '22 at 10:22
20

You cannot create a DataFrame from a generator with the 0.12 version of pandas. You can either update yourself to the development version (get it from the github and compile it - which is a little bit painful on windows but I would prefer this option).

Or you can, since you said you are filtering the lines, first filter them, write them to a file and then load them using read_csv or something else...

If you want to get super complicated you can create a file like object that will return the lines:

def gen():
    lines = [
        'col1,col2\n',
        'foo,bar\n',
        'foo,baz\n',
        'bar,baz\n'
    ]
    for line in lines:
        yield line

class Reader(object):
    def __init__(self, g):
        self.g = g
    def read(self, n=0):
        try:
            return next(self.g)
        except StopIteration:
            return ''

And then use the read_csv:

>>> pd.read_csv(Reader(gen()))
  col1 col2
0  foo  bar
1  foo  baz
2  bar  baz
Viktor Kerkez
  • 45,070
  • 12
  • 104
  • 85
  • 2
    You are right, pandas 0.12 does not support generators. I've installed the dev version and DataFrame constructor allow generators but DataFrame.from_records() not. I've made a patch for it. – tinproject Sep 21 '13 at 14:37
  • @Viktor Kerkez : Quick question, if my generator function had list of lists in lines, but not consistently, say some objects could be lists-of-lists, and some could be simply lists,how would I gracefully change the "read" method, or should I handle it when I iterate over lines in gen() ? – ekta Nov 10 '14 at 05:23
  • @Viktor kerkez : very basic question, but here's what I mean. If I define lines = [ ['col1,col2\n'], ['foo,bar\n'], ['foo,baz\n'], ['bar,baz\n'] ], then keeping the rest same, I see that the Python shell restarts. I also tried instantiating then object for Reader class as r=Reader(gen()) df=pd.read_csv(r) . This suggests to me that there's something very basic about the class(Object) type notation, that I don't understand. My assumption is that I *should* be allowed to create lists if I wanted so, inside of a df "column", but not shell-restart. – ekta Nov 10 '14 at 05:43
  • @ekta `read_csv` function can parse only *"pure"* CSV files which cant contain lists. If you want lists in your data frame columns you'll have to use something else... Either parse json or do it manually. – Viktor Kerkez Nov 10 '14 at 11:56
  • @ViktorKerkez How does your Reader() solution effects on performance? – member555 Nov 28 '15 at 20:51
7

To get it to be memory efficient, read in chunks. Something like this, using Viktor's Reader class from above.

df = pd.concat(list(pd.read_csv(Reader(gen()),chunksize=10000)),axis=1)
Jeff
  • 125,376
  • 21
  • 220
  • 187
2

You can also use something like (Python tested in 2.7.5)

from itertools import izip

def dataframe_from_row_iterator(row_iterator, colnames):
    col_iterator = izip(*row_iterator)
    return pd.DataFrame({cn: cv for (cn, cv) in izip(colnames, col_iterator)})

You can also adapt this to append rows to a DataFrame.

-- Edit, Dec 4th: s/row/rows in last line

  • This has the same problem as presented in the question, it is infeasible to materialize the whole of the data as anything other than a dataframe or numpy array or some other packed form. Here you materialize it as a dict. – U2EF1 Nov 27 '13 at 21:55
  • Agreed, it does materialize the data as a dict. However, you don't have to materialize _all_ of it at once; just consume part of the generator, then append the data to a DataFrame in chunks. Just use itertools.islice to get the chunks from the generator/row_iterator. – Guilherme Freitas Dec 05 '13 at 00:06
2

If generator is just like a list of DataFrames, you need just to create a new DataFrame concatenating elements of the list:

result = pd.concat(list)

Recently I've faced the same problem.

balintbabics
  • 1,291
  • 2
  • 11
  • 25
0

Acoording to ChatGPT, the following code should do the work. Test with Pandas 1.1.5 and Python 3.8

import pandas as pd

def my_generator():
    yield {'Name': 'John', 'Age': 30, 'City': 'New York'}
    yield {'Name': 'Jane', 'Age': 25, 'City': 'Chicago'}
    yield {'Name': 'Mike', 'Age': 35, 'City': 'San Francisco'}

# Create the DataFrame using the generator expression
df = pd.DataFrame(data=my_generator())

# Display the DataFrame
print(df)
shlomiLan
  • 659
  • 1
  • 9
  • 33