Building dataFrame from a list of list of objects takes too long

Question

I am pulling a big amount of data. It comes as a list of lists of objects.

Example: [[objectA, objectB],[objectC],[],[ObjectD]...]

Each object have a lot of attributes, however for my dataframe I need only name, value, timestamp, description. I tried two things:

for events in events_list:
    if len(events) > 0:
       for event in events:
           df = DataFrame([])
           df['timestamp'] = event.timestamp
           df['value'] = event.value
           df['name'] = event.name
           df['desc'] = event.desc
           final_df = final_df.append(df)

This takes around ~ 15 minutes to complete.

I change the code to use python list:

df_list = list()
for events in events_list:
    if len(events) > 0:
       for event in events:
           df_list.append([event.timestamp, event.value, event.name, event.desc])
final_df = pd.DataFrame(df_list, columns=['timestamp', 'value', 'name', 'desc'])

With this change I managed to reduce the time to approximately ~10-11 minutes.

I am still researching if there is a way to do it faster. Before I did the change with python list I tried dictionary but it was way slower than I expected. Currently I am reading about Panads vectorization which seems really fast, however I am not sure if I can use it for my purpose. I know that Python loops are a bit slow and there is not much I can do about them, so I am also trying to figure out a way to do those loops in the dataframe.

My question is, has any of you tackled this problem before and is there a better way to do it ?

EDIT: There are questions about the data. It comes through an API and it is constructed this way because every group of objects is grouped by name. For example:

[[objectA, objectB (both have the same name)],[objectC],[EMPTY - There is no data for this name],[ObjectD]...]

Because I cannot change the way I get the data, I have to work with this data structure.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/206181/discussion-on-question-by-noonewins-building-dataframe-from-a-list-of-list-of-ob). — Samuel Liew, Jan 18 '20 at 03:57

Alon Gadot · Answer 1 · 2020-01-17T17:48:00.777

1

The computationally heavy operation in your initial method is append - each time you are using final_df.append(df) you are creating an entirely new (and larger each iteration!) dataframe. Instead, aggregate all the dataframes into a list and use pd.concat(df_list).

To go faster than that you may want to consider using multiprocessing to some extent, either through the standard python multiprocessing libraries or through a framework - I recommend Dask.

Edit: P.S If your data is originally in a csv/excel/parquet or another format supported by pandas, you can use pandas to load all the data at once in a very efficient manner. Even if your events include unnecessary columns, it'll be much faster to load the entire data set and then filter out the redundant columns.

edited Jan 17 '20 at 17:48

answered Jan 17 '20 at 17:40

Alon Gadot

555
3
10

I tried pd.concat it did not speed things up, the only plus was it lower the memory usage. However, memory is not a problem in this case. So far building list of lists and giving it to pd.DataFrame is the fastest way. I am considering using multiprocessing for the list creation, however I still hope that there is a better way to do it with for example pandas or numpy. – Noonewins Jan 17 '20 at 17:49
How do you query your databse? Are you using an ORM? Pandas does have a pd.read_sql method. – Alon Gadot Jan 17 '20 at 17:50
Yes we use ORM, however I do not have direct access to the database because of security reasons I guess. We use api that returns this list of list of objects based on a query. – Noonewins Jan 17 '20 at 17:55

AMC · Answer 2 · 2020-01-17T20:36:44.610

1

How about something like this?

import datetime
import itertools as itt
import operator
import random
from dataclasses import dataclass

import pandas as pd


# DUMMY DATA SETUP

@dataclass
class Obj:
    name: str
    timestamp: datetime.datetime
    value: int
    desc: str


group_lens = [random.randint(0, 1000) for _ in range(200000)]
event_count = 1

events = []
for curr_group_len in group_lens:
    curr_group = []
    for _ in range(curr_group_len):
        curr_group.append(
            Obj(f"event_{event_count}", datetime.datetime.now(), random.randint(-100, 100), f"event_{event_count} desc"))
        event_count += 1
    events.append(curr_group)

# DATAFRAME CREATION

cust_getter = operator.attrgetter('name', 'timestamp', 'value', 'desc')

df = pd.DataFrame(data=(cust_getter(elem) for elem in itt.chain.from_iterable(events)))

I tested it on a 2-dimensional list of 10,006,766 elements, and it only took 9 seconds.

edited Jan 17 '20 at 20:36

answered Jan 17 '20 at 19:28

AMC

2,642
7
13
35

605.20 seconds for loop; 6.15 seconds for df creation; 12,412,942 events count. I guess its the object then. I will have to experiment with extracting the data. – Noonewins Jan 17 '20 at 19:54
@Noonewins _605.20 seconds for loop;_ Which loop? I meant benchmark with your data, using the DataFrame creation method I shared. – AMC Jan 17 '20 at 20:01
.\test.py > 10,007,089 elements 166.06090259552002 seconds - This is the result when I run your script with the object that I am using. – Noonewins Jan 17 '20 at 20:30
Yes, the results are promising :) I am now trying to implement it in my code to see the final result. Thank you. – Noonewins Jan 17 '20 at 20:33
@Noonewins Can you share the "final"/complete benchmark results when they're done? It might be worth profiling it again, by the way. – AMC Jan 17 '20 at 20:34
I am marking this post as an answer. It did speed the for loop significantly. 231.80 seconds for loop, 5.94 seconds df creation, 12,412,942 events count. – Noonewins Jan 20 '20 at 14:03
@Noonewins Do you want to share the implementation you came up with, or are you fine with the performance at this point? – AMC Jan 20 '20 at 18:23
it runs almost 3 times faster now, so I am happy with that. The implementation is exactly as in your post, with a change of how I append the objects data. – Noonewins Jan 20 '20 at 19:44
@Noonewins What appending? The first part of the code I posted is for my purposes only, it’s meant to reproduce what you already have. – AMC Jan 20 '20 at 19:48
Maybe I should give a bit more information. First I go through the list in a for loop as usual. The difference is when i build my list of values I do it as list comprehension: data_list.appen([x.name, x.timestamp, x.value, x.desc for x in events]) then when I create the dataframe I do it this way: df = pd.DataFrame(data=([list(elem) for elem in itt.chain.from_iterable(data_list)])). This reduced the time. I hope this is helpful to other people. – Noonewins Jan 20 '20 at 20:14
@Noonewins I’m confused, looking at the first snippet of code in your post, isn’t that what you were already doing? – AMC Jan 20 '20 at 22:14
@Noonewins Did you ever straighten things out? – AMC Jan 23 '20 at 01:00
sorry for not responding sooner. I am super busy with this project. As it turns out, when doing list comprehension, and appending list of generators to a list, its way faster then if we extract those values and build a list of lists with values. However the problem is now the creation of the dataframe. I am looking for ways to unpack the list of generators so that I can create the dataframe. For now I had to put this performance issue a side, but when I have time I will tackle it again. – Noonewins Jan 23 '20 at 16:02
@Noonewins _As it turns out, when doing list comprehension, and appending list of generators to a list, its way faster then if we extract those values and build a list of lists with values._ I'm not sure I understand exactly what you mean. What do you mean by _unpack the list of generators so that I can create the dataframe_ ? – AMC Jan 23 '20 at 18:08
when I append to the list this way: data_list.appen([x.name, x.timestamp, x.value, x.desc for x in events]), I get [[,..]..]. The generator yields one item at a time and generates item only when in demand. Whereas, in a list comprehension, Python reserves memory for the whole list. The problem comes when I try to create the dataframe from this data structure ( [[generator, generator..],..]). I tried to use zip, itt.chain.from_iterable but it takes time. This is what I got so far. – Noonewins Jan 23 '20 at 19:16
@Noonewins Why do you need to build the list like that in the first place though? – AMC Jan 23 '20 at 19:36
building it with generators reduced the extraction of the data from 300 sec to 5 sec. This way of building the list is a lot faster. The problem comes when I need to extract the data. – Noonewins Jan 23 '20 at 19:53
@Noonewins _This way of building the list is a lot faster._ Which way? Is it faster than what I suggested? – AMC Jan 23 '20 at 20:29

score 0 · Accepted Answer · answered Mar 02 '20 at 17:50

I found answer to my question using generators: Here is a link to another thread that was specifically created to figure out how to create dataframe from a list of python generators. In there we figure out a way to solve the problem from this thread: Create Pandas Dataframe from List of Generators

To summarize it, I replaced this:

for events in events_list:
    if len(events) > 0:
       for event in events:
           for record in event:
               df_list.append([record.timestamp, record.value, record.name, record.desc])
final_df = pd.DataFrame(df_list, columns=['timestamp', 'value', 'name', 'desc'])

With this:

data= (record.Timestamp, record.Value, record.Name, record.desc) 
                for events in events_list for event in events for record in event)

dataframe = pd.DataFrame(data, columns=["timestamp", "value", "name", "desc])

Using List comprehension I save a lot of time by building the list as a whole and not doing continuous append.

Test with 15 million records (including creation of the DF):

list append with for-loop = 16 minutes

list_comprehension = 3 minutes

I will continue to test this for the next couple of days with different amount of data.

Building dataFrame from a list of list of objects takes too long

3 Answers3