Create Pandas Dataframe from List of Generators

Question

I have to following question. Is there a way to build a DataFrame from a list of python Generator objects. I used list comprehension to create the list with data for the dataframe:

data_list.append([record.Timestamp,record.Value, record.Name, record.desc] for record in records)

I did it this way because normal list append in a for loop is taking like 20x times longer:

for record in records:
    data_list.append(record.Timestamp,record.Value, record.Name, record.desc)

I tried to create the dataframe but it doesn't work:

This:

dataframe = pd.DataFrame(data_list, columns=['timestamp', 'value', 'name', 'desc'])

Throws exception:

ValueError: 4 columns passed, passed data had 142538 columns.

I also tried to use itertools like this:

dataframe = pd.DataFrame(data=([list(elem) for elem in itt.chain.from_iterable(data_list)]), columns=['timestamp', 'value', 'name', 'desc'])

This results as a empty DataFrame:

Empty DataFrame\nColumns: [timestamp, value, name, desc]\nIndex: []

data_list looks like this:

[<generator object St...51DB0>, <generator object St...56EB8>,<generator object St...51F10>, <generator object St...51F68>]

Code for generating the list looks like this:

for events in events_list:
    for record in events:
        data_list.append([record.Timestamp,record.Value, record.Name, record.desc] for record in records)

This is required because of events list data structure.

Is there a way for me to create a dataframe out of list of Generators? If there is, is it going to be time efficient? What I mean is that I save a lot of time with replacing normal for loop with list comprehension, however if the creation of dataframe takes more time, this action will be pointless.

sim · Accepted Answer · 2020-03-02T14:57:45.023

Just turn your data_list into a generator expression as well. For example:

from collections import namedtuple

MyData = namedtuple("MyData", ["a"])
data = (d.a for d in (MyData(i) for i in range(100)))
df = pd.DataFrame(data)

will work just fine. So what you should do is have:

data = ((record.Timestamp,record.Value, record.Name, record.desc) for record in records)
df = pd.DataFrame(data, columns=["Timestamp", "Value", "Name", "Desc"])

The actual reason why your approach does not work is because you have a single entry in your data_list which is a generator over - I suppose - 142538 records. Pandas will try to cram that single entry in your data_list into a single row (so all the 142538 entries, each a list of four elements) and fails, since it expects rather 4 columns to be passed.

Edit: you can of course make the generator expression more complex, here's an example along the lines of your additional loop over events:

from collections import namedtuple
MyData = namedtuple("MyData", ["a", "b"])
data = ((d.a, d.b) for j in range(100) for d in (MyData(j, j+i) for i in range(100)))
pd.DataFrame(data, columns=["a", "b"])

edit: here's also an example using data structures like you are using:

Record = namedtuple("Record", ["Timestamp", "Value", "Name", "desc"])

event_list = [[Record(Timestamp=1, Value=1, Name=1, desc=1),
               Record(Timestamp=2, Value=2, Name=2, desc=2)],
              [Record(Timestamp=3, Value=3, Name=3, desc=3)]]

data = ((r.Timestamp, r.Value, r.Name, r.desc) for events in event_list for r in events)
pd.DataFrame(data, columns=["timestamp", "value", "name", "desc"])

Output:

    timestamp   value   name    desc
0   1   1   1   1
1   2   2   2   2
2   3   3   3   3

I have a list of generators. So I don't think data = ((record.Timestamp,record.Value, record.Name, record.desc) for record in records) will work. The list comprehension is in another for loop. — Noonewins, Mar 02 '20 at 14:42
@Noonewins: My approach should get you the same as if you were creating the data_list using the for-loop in your second code block. See also the explanation why your first code block will not work as expected - hope that makes sense. — sim, Mar 02 '20 at 14:46
@Noonewins: I have added a more complex generator expression example that incorporates your loop over the event_list. — sim, Mar 02 '20 at 15:02
Okay, I really like your answer and it works pretty well until I have to create the dataframe. data = ((r.Timestamp, r.Value, r.Name, r.desc) for events in event_list for r in events) this works great, altho in my case there is a 1 more loop to add. When I check the generator for data [list(x) for x in data] I see it. However, once a add it to dataframe, it says that the dataframe is empty. I am now trying to figure it out. — Noonewins, Mar 02 '20 at 15:20
@Noonewins: Once you have exhausted the generator (e.g. by checking the generator for data as you described), you cannot iterate over it again. That's probably the reason why it says that the dataframe is empty. — sim, Mar 02 '20 at 15:25

score 0 · Answer 2 · answered Mar 02 '20 at 14:51

0

pd.concat(some_generator_yielding_dfs) will work (this is actually one of the tricks to alleviate the load of big tables). E.g. one may do like this:

pd.concat((pd.read_csv(x) for x in files))

answered Mar 02 '20 at 14:51

Oleg O

1,005
6
11

but with this you add additional copying overhead – sim Mar 02 '20 at 14:53
@Oleg O, I did try to concat dataframes, however because of the size it actually took more time and more memory. I think when you concat a dataframe it creates a copy of the dataframe in memory. – Noonewins Mar 02 '20 at 14:55
@sim what do you mean? – Oleg O Mar 02 '20 at 14:55
@Noonewins right, it creates a copy of one table that it appends at the moment, but then it's destroyed upon the end of the current concat, since it has no identifier assigned to it. – Oleg O Mar 02 '20 at 14:57

CypherX · Answer 3 · 2020-03-02T14:59:41.240

Solution

Make a dict with the columns you need as shown below.
Feed the dict to pandas.Dataframe

Note: The use of list(generator) produces all the data as a list.

import pandas as pd
import ast

# Method-1: create a dict by direct declaration
d = {
    'timestamp': list(record.Timestamp),
    'value': list(record.Value),
    'name': list(record.Name), 
    'desc': list(record.desc), 
}

# Method-2: create a dict using dict-comprehension
keys = ['Timestamp', 'Value', 'Name', 'desc']
d = dict((str(key).lower(), ast.literal_eval(f'list(record.{key})')) for key in keys)

# Finally create the dataframe using the dictionary
dataframe = pd.DataFrame(d).T

See Also:

Is there any shorthand for 'yield all the output from a generator'?

@Noonewins Please let me know if this worked for you. – CypherX Mar 02 '20 at 15:04 — CypherX, Mar 02 '20 at 15:04

Create Pandas Dataframe from List of Generators

3 Answers3

Solution

Linked