1

I have a program with a nested structure that is currently written using the obvious approach for appending a list-of-lists to the top-level DataFrame (by creating a DataFrame from that list-of-lists and then appending that to the target DataFrame):

import pandas as pd

columns=["inner", "outer", "col1", "col2", "col3", "col4"]
def create_children(inner, outer):
    results = []
    for i in range(inner):
        results.append([f'{i}', f'{outer}', 'a', 'b', 'c', 'd'])

    return results

def test(outer, inner):
    df = pd.DataFrame(columns=columns)
    for i in range(outer):
        children = create_children(inner, i)
        child_df = pd.DataFrame(children, columns=columns)
        df = pd.concat([df, child_df]) # Faster than append

    return df

The problem is that when I profile this, the creation of the child DataFrame is taking a serious amount of time:

Timer unit: 1e-06 s

Total time: 0.012352 s
File: <ipython-input-43-d816d566eb1b>
Function: test at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def test(outer, inner):
     2         1       5542.0   5542.0     44.9      df = pd.DataFrame(columns=columns)
     3         3          5.0      1.7      0.0      for i in range(outer):
     4         2         10.0      5.0      0.1          children = create_children(inner, i)
     5         2       4341.0   2170.5     35.1          child_df = pd.DataFrame(children, columns=columns)
     6         2       2454.0   1227.0     19.9          df = pd.concat([df, child_df])
     7                                           # Works in this case but problems with an index and slightly slower
     8                                           #         df = df.append(child_df)
     9                                           
    10         1          0.0      0.0      0.0      return df

If I rewrite this simple example to only create the DataFrame at the end, then it is significantly faster:

def test2(outer, inner):
    all_children = []
    for i in range(outer):
        children = create_children(inner, i)
        all_children.extend(children)

    df = pd.DataFrame(all_children, columns=columns)

    return df

Giving:

Timer unit: 1e-06 s

Total time: 0.002104 s
File: <ipython-input-44-05d8d95dfe60>
Function: test2 at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def test2(outer, inner):
     2         1          1.0      1.0      0.0      all_children = []
     3         3          4.0      1.3      0.2      for i in range(outer):
     4         2          8.0      4.0      0.4          children = create_children(inner, i)
     5         2          2.0      1.0      0.1          all_children.extend(children)
     6                                           
     7         1       2088.0   2088.0     99.2      df = pd.DataFrame(all_children, columns=columns)
     8                                                   
     9         1          1.0      1.0      0.0      return df

Unfortunately, the program in question makes use of DataFrame features in the outer loop, so I can't simply eliminate the use of the DataFrame. (My ultimate goal is to do this, but it is a fair bit of refactoring.)

My question is: Is there a way to append a conforming list-of-lists to a DataFrame without creating the intermediate DataFrame, which seems to entail a lot of overhead?

sfjac
  • 7,119
  • 5
  • 45
  • 69
  • 1
    Read this: [Never call `DataFrame.append` or `pd.concat` inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) – Parfait Apr 19 '20 at 00:20

1 Answers1

1

I would suggest saving the intermediate data in a dictionary and appending that dictionary to the list. In the end you can simply create your final DataFrame:

columns=["inner", "outer", "col1", "col2", "col3", "col4"]
def create_children(inner, outer):
results = []
for i in range(inner):
    dct = {'inner': some_value,
           ....,
          {'col4':  another_value,
    results.append(dct)
return results

def test(outer, inner):
all_results = []
for i in range(outer):
    children = create_children(inner, i)
    all_results.extend(children)

df = pd.DataFrame(all_results, columns=columns)
return df
Ralvi Isufaj
  • 442
  • 2
  • 9
  • Agree that waiting to assemble the DataFrame until the end is optimal - I did it with lists of lists in the post because that's the way the data is computed, but in many other applications I've done it like this. Just trying to figure out if there is a way to more-efficiently do it piecemeal since I currently need DataFrame functionality during the outer loop. – sfjac Apr 19 '20 at 15:52
  • 1
    I see. Well, creating dataframes is an expensive operation. Depending on your usecase, perhaps you should have a look at how can you achieve what you want with numpy arrays, they are much faster than working with dataframes. – Ralvi Isufaj Apr 19 '20 at 16:12
  • That's the long term goal. I'm just surprised at the overhead of appending when the columns are matching and already known. Was hoping there was something non-obvious that I was missing. – sfjac Apr 19 '20 at 21:12