2

I am pulling a big amount of data. It comes as a list of lists of objects.

Example: [[objectA, objectB],[objectC],[],[ObjectD]...]

Each object have a lot of attributes, however for my dataframe I need only name, value, timestamp, description. I tried two things:

for events in events_list:
    if len(events) > 0:
       for event in events:
           df = DataFrame([])
           df['timestamp'] = event.timestamp
           df['value'] = event.value
           df['name'] = event.name
           df['desc'] = event.desc
           final_df = final_df.append(df)

This takes around ~ 15 minutes to complete.

I change the code to use python list:

df_list = list()
for events in events_list:
    if len(events) > 0:
       for event in events:
           df_list.append([event.timestamp, event.value, event.name, event.desc])
final_df = pd.DataFrame(df_list, columns=['timestamp', 'value', 'name', 'desc'])

With this change I managed to reduce the time to approximately ~10-11 minutes.

I am still researching if there is a way to do it faster. Before I did the change with python list I tried dictionary but it was way slower than I expected. Currently I am reading about Panads vectorization which seems really fast, however I am not sure if I can use it for my purpose. I know that Python loops are a bit slow and there is not much I can do about them, so I am also trying to figure out a way to do those loops in the dataframe.

My question is, has any of you tackled this problem before and is there a better way to do it ?

EDIT: There are questions about the data. It comes through an API and it is constructed this way because every group of objects is grouped by name. For example:

[[objectA, objectB (both have the same name)],[objectC],[EMPTY - There is no data for this name],[ObjectD]...]

Because I cannot change the way I get the data, I have to work with this data structure.

Noonewins
  • 71
  • 12
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/206181/discussion-on-question-by-noonewins-building-dataframe-from-a-list-of-list-of-ob). – Samuel Liew Jan 18 '20 at 03:57

3 Answers3

1

The computationally heavy operation in your initial method is append - each time you are using final_df.append(df) you are creating an entirely new (and larger each iteration!) dataframe. Instead, aggregate all the dataframes into a list and use pd.concat(df_list).

To go faster than that you may want to consider using multiprocessing to some extent, either through the standard python multiprocessing libraries or through a framework - I recommend Dask.

Edit: P.S If your data is originally in a csv/excel/parquet or another format supported by pandas, you can use pandas to load all the data at once in a very efficient manner. Even if your events include unnecessary columns, it'll be much faster to load the entire data set and then filter out the redundant columns.

Alon Gadot
  • 555
  • 3
  • 10
  • I tried pd.concat it did not speed things up, the only plus was it lower the memory usage. However, memory is not a problem in this case. So far building list of lists and giving it to pd.DataFrame is the fastest way. I am considering using multiprocessing for the list creation, however I still hope that there is a better way to do it with for example pandas or numpy. – Noonewins Jan 17 '20 at 17:49
  • How do you query your databse? Are you using an ORM? Pandas does have a pd.read_sql method. – Alon Gadot Jan 17 '20 at 17:50
  • Yes we use ORM, however I do not have direct access to the database because of security reasons I guess. We use api that returns this list of list of objects based on a query. – Noonewins Jan 17 '20 at 17:55
1

How about something like this?

import datetime
import itertools as itt
import operator
import random
from dataclasses import dataclass

import pandas as pd


# DUMMY DATA SETUP

@dataclass
class Obj:
    name: str
    timestamp: datetime.datetime
    value: int
    desc: str


group_lens = [random.randint(0, 1000) for _ in range(200000)]
event_count = 1

events = []
for curr_group_len in group_lens:
    curr_group = []
    for _ in range(curr_group_len):
        curr_group.append(
            Obj(f"event_{event_count}", datetime.datetime.now(), random.randint(-100, 100), f"event_{event_count} desc"))
        event_count += 1
    events.append(curr_group)

# DATAFRAME CREATION

cust_getter = operator.attrgetter('name', 'timestamp', 'value', 'desc')

df = pd.DataFrame(data=(cust_getter(elem) for elem in itt.chain.from_iterable(events)))

I tested it on a 2-dimensional list of 10,006,766 elements, and it only took 9 seconds.

AMC
  • 2,642
  • 7
  • 13
  • 35
  • 605.20 seconds for loop; 6.15 seconds for df creation; 12,412,942 events count. I guess its the object then. I will have to experiment with extracting the data. – Noonewins Jan 17 '20 at 19:54
  • @Noonewins _605.20 seconds for loop;_ Which loop? I meant benchmark with your data, using the DataFrame creation method I shared. – AMC Jan 17 '20 at 20:01
  • .\test.py > 10,007,089 elements 166.06090259552002 seconds - This is the result when I run your script with the object that I am using. – Noonewins Jan 17 '20 at 20:30
  • Yes, the results are promising :) I am now trying to implement it in my code to see the final result. Thank you. – Noonewins Jan 17 '20 at 20:33
  • @Noonewins Can you share the "final"/complete benchmark results when they're done? It might be worth profiling it again, by the way. – AMC Jan 17 '20 at 20:34
  • I am marking this post as an answer. It did speed the for loop significantly. 231.80 seconds for loop, 5.94 seconds df creation, 12,412,942 events count. – Noonewins Jan 20 '20 at 14:03
  • @Noonewins Do you want to share the implementation you came up with, or are you fine with the performance at this point? – AMC Jan 20 '20 at 18:23
  • it runs almost 3 times faster now, so I am happy with that. The implementation is exactly as in your post, with a change of how I append the objects data. – Noonewins Jan 20 '20 at 19:44
  • @Noonewins What appending? The first part of the code I posted is for my purposes only, it’s meant to reproduce what you already have. – AMC Jan 20 '20 at 19:48
  • Maybe I should give a bit more information. First I go through the list in a for loop as usual. The difference is when i build my list of values I do it as list comprehension: data_list.appen([x.name, x.timestamp, x.value, x.desc for x in events]) then when I create the dataframe I do it this way: df = pd.DataFrame(data=([list(elem) for elem in itt.chain.from_iterable(data_list)])). This reduced the time. I hope this is helpful to other people. – Noonewins Jan 20 '20 at 20:14
  • @Noonewins I’m confused, looking at the first snippet of code in your post, isn’t that what you were already doing? – AMC Jan 20 '20 at 22:14
  • @Noonewins Did you ever straighten things out? – AMC Jan 23 '20 at 01:00
  • sorry for not responding sooner. I am super busy with this project. As it turns out, when doing list comprehension, and appending list of generators to a list, its way faster then if we extract those values and build a list of lists with values. However the problem is now the creation of the dataframe. I am looking for ways to unpack the list of generators so that I can create the dataframe. For now I had to put this performance issue a side, but when I have time I will tackle it again. – Noonewins Jan 23 '20 at 16:02
  • @Noonewins _As it turns out, when doing list comprehension, and appending list of generators to a list, its way faster then if we extract those values and build a list of lists with values._ I'm not sure I understand exactly what you mean. What do you mean by _unpack the list of generators so that I can create the dataframe_ ? – AMC Jan 23 '20 at 18:08
  • when I append to the list this way: data_list.appen([x.name, x.timestamp, x.value, x.desc for x in events]), I get [[,..]..]. The generator yields one item at a time and generates item only when in demand. Whereas, in a list comprehension, Python reserves memory for the whole list. The problem comes when I try to create the dataframe from this data structure ( [[generator, generator..],..]). I tried to use zip, itt.chain.from_iterable but it takes time. This is what I got so far. – Noonewins Jan 23 '20 at 19:16
  • @Noonewins Why do you need to build the list like that in the first place though? – AMC Jan 23 '20 at 19:36
  • building it with generators reduced the extraction of the data from 300 sec to 5 sec. This way of building the list is a lot faster. The problem comes when I need to extract the data. – Noonewins Jan 23 '20 at 19:53
  • @Noonewins _This way of building the list is a lot faster._ Which way? Is it faster than what I suggested? – AMC Jan 23 '20 at 20:29
0

I found answer to my question using generators: Here is a link to another thread that was specifically created to figure out how to create dataframe from a list of python generators. In there we figure out a way to solve the problem from this thread: Create Pandas Dataframe from List of Generators

To summarize it, I replaced this:

for events in events_list:
    if len(events) > 0:
       for event in events:
           for record in event:
               df_list.append([record.timestamp, record.value, record.name, record.desc])
final_df = pd.DataFrame(df_list, columns=['timestamp', 'value', 'name', 'desc'])

With this:

data= (record.Timestamp, record.Value, record.Name, record.desc) 
                for events in events_list for event in events for record in event)

dataframe = pd.DataFrame(data, columns=["timestamp", "value", "name", "desc])

Using List comprehension I save a lot of time by building the list as a whole and not doing continuous append.

Test with 15 million records (including creation of the DF):

list append with for-loop = 16 minutes

list_comprehension = 3 minutes

I will continue to test this for the next couple of days with different amount of data.

Noonewins
  • 71
  • 12