42

I am confused by the performance in Pandas when building a large dataframe chunk by chunk. In Numpy, we (almost) always see better performance by preallocating a large empty array and then filling in the values. As I understand it, this is due to Numpy grabbing all the memory it needs at once instead of having to reallocate memory with every append operation.

In Pandas, I seem to be getting better performance by using the df = df.append(temp) pattern.

Here is an example with timing. The definition of the Timer class follows. As you, see I find that preallocating is roughly 10x slower than using append! Preallocating a dataframe with np.empty values of the appropriate dtype helps a great deal, but the append method is still the fastest.

import numpy as np
from numpy.random import rand
import pandas as pd

from timer import Timer

# Some constants
num_dfs = 10  # Number of random dataframes to generate
n_rows = 2500
n_cols = 40
n_reps = 100  # Number of repetitions for timing

# Generate a list of num_dfs dataframes of random values
df_list = [pd.DataFrame(rand(n_rows*n_cols).reshape((n_rows, n_cols)), columns=np.arange(n_cols)) for i in np.arange(num_dfs)]

##
# Define two methods of growing a large dataframe
##

# Method 1 - append dataframes
def method1():
    out_df1 = pd.DataFrame(columns=np.arange(4))
    for df in df_list:
        out_df1 = out_df1.append(df, ignore_index=True)
    return out_df1

def method2():
# # Create an empty dataframe that is big enough to hold all the dataframes in df_list
out_df2 = pd.DataFrame(columns=np.arange(n_cols), index=np.arange(num_dfs*n_rows))
#EDIT_1: Set the dtypes of each column
for ix, col in enumerate(out_df2.columns):
    out_df2[col] = out_df2[col].astype(df_list[0].dtypes[ix])
# Fill in the values
for ix, df in enumerate(df_list):
    out_df2.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values
return out_df2

# EDIT_2: 
# Method 3 - preallocate dataframe with np.empty data of appropriate type
def method3():
    # Create fake data array
    data = np.transpose(np.array([np.empty(n_rows*num_dfs, dtype=dt) for dt in df_list[0].dtypes]))
    # Create placeholder dataframe
    out_df3 = pd.DataFrame(data)
    # Fill in the real values
    for ix, df in enumerate(df_list):
        out_df3.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values
    return out_df3

##
# Time both methods
##

# Time Method 1
times_1 = np.empty(n_reps)
for i in np.arange(n_reps):
    with Timer() as t:
       df1 = method1()
    times_1[i] = t.secs
print 'Total time for %d repetitions of Method 1: %f [sec]' % (n_reps, np.sum(times_1))
print 'Best time: %f' % (np.min(times_1))
print 'Mean time: %f' % (np.mean(times_1))

#>>  Total time for 100 repetitions of Method 1: 2.928296 [sec]
#>>  Best time: 0.028532
#>>  Mean time: 0.029283

# Time Method 2
times_2 = np.empty(n_reps)
for i in np.arange(n_reps):
    with Timer() as t:
        df2 = method2()
    times_2[i] = t.secs
print 'Total time for %d repetitions of Method 2: %f [sec]' % (n_reps, np.sum(times_2))
print 'Best time: %f' % (np.min(times_2))
print 'Mean time: %f' % (np.mean(times_2))

#>>  Total time for 100 repetitions of Method 2: 32.143247 [sec]
#>>  Best time: 0.315075
#>>  Mean time: 0.321432

# Time Method 3
times_3 = np.empty(n_reps)
for i in np.arange(n_reps):
    with Timer() as t:
        df3 = method3()
    times_3[i] = t.secs
print 'Total time for %d repetitions of Method 3: %f [sec]' % (n_reps, np.sum(times_3))
print 'Best time: %f' % (np.min(times_3))
print 'Mean time: %f' % (np.mean(times_3))

#>>  Total time for 100 repetitions of Method 3: 6.577038 [sec]
#>>  Best time: 0.063437
#>>  Mean time: 0.065770

I use a nice Timer courtesy of Huy Nguyen:

# credit: http://www.huyng.com/posts/python-performance-analysis/

import time

class Timer(object):
    def __init__(self, verbose=False):
        self.verbose = verbose

    def __enter__(self):
        self.start = time.clock()
        return self

    def __exit__(self, *args):
        self.end = time.clock()
        self.secs = self.end - self.start
        self.msecs = self.secs * 1000  # millisecs
        if self.verbose:
            print 'elapsed time: %f ms' % self.msecs

If you are still following, I have two questions:

1) Why is the append method faster? (NOTE: for very small dataframes, i.e. n_rows = 40, it is actually slower).

2) What is the most efficient way to build a large dataframe out of chunks? (In my case, the chunks are all large csv files).

Thanks for your help!

EDIT_1: In my real world project, the columns have different dtypes. So I cannot use the pd.DataFrame(.... dtype=some_type) trick to improve the performance of preallocation, per BrenBarn's recommendation. The dtype parameter forces all the columns to be the same dtype [Ref. issue 4464]

I added some lines to method2() in my code to change the dtypes column-by-column to match in the input dataframes. This operation is expensive and negates the benefits of having the appropriate dtypes when writing blocks of rows.

EDIT_2: Try preallocating a dataframe using placeholder array np.empty(... dtyp=some_type). Per @Joris's suggestion.

andrew
  • 3,929
  • 1
  • 25
  • 38
  • Interesting question... – Matt Jul 29 '15 at 02:09
  • For large csv files, look into the [pandas csv parser](http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.io.parsers.read_csv.html) with chunksize as an argument. – Parfait Jul 29 '15 at 02:22
  • I thought about `read_csv` chunksize. I've profiled my actual application enough to see that parsing is not the bottleneck. The files are not that "large", only about 60k rows each. The bottleneck is building the big dataframe. – andrew Jul 29 '15 at 02:31
  • 1
    You can use ``np.empty`` calls with the correct dtypes to preallocate a dataframe with varying dtypes – joris Jul 29 '15 at 22:12
  • @joris See EDIT_2. It is much faster, but still not faster than append. – andrew Jul 29 '15 at 22:41

4 Answers4

34

Your benchmark is actually too small to show the real difference. Appending, copies EACH time, so you are actually doing copying a size N memory space N*(N-1) times. This is horribly inefficient as the size of your dataframe grows. This certainly might not matter in a very small frame. But if you have any real size this matters a lot. This is specifically noted in the docs here, though kind of a small warning.

In [97]: df = DataFrame(np.random.randn(100000,20))

In [98]: df['B'] = 'foo'

In [99]: df['C'] = pd.Timestamp('20130101')

In [103]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 22 columns):
0     100000 non-null float64
1     100000 non-null float64
2     100000 non-null float64
3     100000 non-null float64
4     100000 non-null float64
5     100000 non-null float64
6     100000 non-null float64
7     100000 non-null float64
8     100000 non-null float64
9     100000 non-null float64
10    100000 non-null float64
11    100000 non-null float64
12    100000 non-null float64
13    100000 non-null float64
14    100000 non-null float64
15    100000 non-null float64
16    100000 non-null float64
17    100000 non-null float64
18    100000 non-null float64
19    100000 non-null float64
B     100000 non-null object
C     100000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(20), object(1)
memory usage: 17.5+ MB

Appending

In [85]: def f1():
   ....:     result = df
   ....:     for i in range(9):
   ....:         result = result.append(df)
   ....:     return result
   ....: 

Concat

In [86]: def f2():
   ....:     result = []
   ....:     for i in range(10):
   ....:         result.append(df)
   ....:     return pd.concat(result)
   ....: 

In [100]: f1().equals(f2())
Out[100]: True

In [101]: %timeit f1()
1 loops, best of 3: 1.66 s per loop

In [102]: %timeit f2()
1 loops, best of 3: 220 ms per loop

Note that I wouldn't even bother trying to pre-allocate. Its somewhat complicated, especially since you are dealing with multiple dtypes (e.g. you could make a giant frame and simply .loc and it would work). But pd.concat is just dead simple, works reliably, and fast.

And timing of your sizes from above

In [104]: df = DataFrame(np.random.randn(2500,40))

In [105]: %timeit f1()
10 loops, best of 3: 33.1 ms per loop

In [106]: %timeit f2()
100 loops, best of 3: 4.23 ms per loop
endolith
  • 25,479
  • 34
  • 128
  • 192
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • You are absolutely right. I just finished a run where I increased `num_dfs` to 300 hundred. In this case `method3()` is the clear winner. I will write and test a `method4()` that used `pd.concat`. – andrew Jul 30 '15 at 00:47
  • are you not trading time complexity for space complexity here? method two will use twice the space, as you hold all the data first in the list "result", then copying it to the dataframe. I believe that is also why it is faster. You are not really adding them one at a time. By passing the list to the concat function the underlying backend can probably check how large an dataframe it will require and allocate the space accordingly (though it didn't check the implementation) – user3917718 Mar 27 '20 at 09:46
8

@Jeff, pd.concat wins by a mile! I benchmarked a fourth method using pd.concat with num_dfs = 500. The results are unequivocal:

The method4() definition:

# Method 4 - us pd.concat on df_list
def method4():
return pd.concat(df_list, ignore_index=True)

Profiling results, using the same Timer in my original question:

Total time for 100 repetitions of Method 1: 3679.334655 [sec]
Best time: 35.570036
Mean time: 36.793347
Total time for 100 repetitions of Method 2: 1569.917425 [sec]
Best time: 15.457102
Mean time: 15.699174
Total time for 100 repetitions of Method 3: 325.730455 [sec]
Best time: 3.192702
Mean time: 3.257305
Total time for 100 repetitions of Method 4: 25.448473 [sec]
Best time: 0.244309
Mean time: 0.254485

The pd.concat method is 13x faster than preallocating with a np.empty(... dtype) palceholder.

andrew
  • 3,929
  • 1
  • 25
  • 38
6

You didn't specify any data or type for out_df2, so it has the "object" dtype. This makes assigning values to it very slow. Specify float64 dtype:

out_df2 = pd.DataFrame(columns=np.arange(n_cols), index=np.arange(num_dfs*n_rows), dtype=np.float64)

You will see a dramatic speedup. When I tried it, method2 with this change is about twice as fast as method1.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • I also find that your solution makes preallocation the faster method. But I did not accept as an answer because my problem requires using differnt dtypes. I edited the question to reflect this. – andrew Jul 29 '15 at 21:29
2

Jeff's answer is correct, but I found for my data type another solution worked better.

def df_(): 
    return pd.DataFrame(['foo']*np.random.randint(100)).transpose()

k = 100
frames = [df_() for x in range(0, k)]

def f1():
    result = frames[0]
    for i in range(k-1):
        result = result.append(frames[i+1])
    return result

def f2():  
    result = []
    for i in range(k):
        result.append(frames[i])
    return pd.concat(result)

def f3():
    result = []
    for i in range(k):
       result.append(frames[i])

    n = 2
    while len(result) > 1:
        _result = []
        for i in range(0, len(result), n):
            _result.append(pd.concat(result[i:i+n]))
        result = _result
    return result[0]

My dataframes are a single row and of varying length - the null entries must have something to do with why f3() succeeds.

In [33]: f1().equals(f2())
Out[33]: True

In [34]: f1().equals(f3())
Out[34]: True

In [35]: %timeit f1()
357 ms ± 192 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [36]: %timeit f2()
562 ms ± 68.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [37]: %timeit f3()
215 ms ± 58.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The above results is still for k=100, but for larger k it is even more significant.

Tom
  • 51
  • 4