Efficient way to combine pandas data frames row-wise

Question

I've 14 data frames each with 14 columns and more than 250,000 rows. The data frame have identical column headers and I would like to merge the data frames row-wise. I attempted to concatenate the data frames to a 'growing' DataFrame and it's taking several hours.

Essentially, I was doing something like below 13 times:

DF = pd.DataFrame()
for i in range(13):   
    DF = pd.concat([DF, subDF])

The stackoverflow answer here suggests appending all sub data frames to a list and then concatenating the list of sub data frames.

That sounds like doing something like this:

DF = pd.DataFrame()
lst = [subDF, subDF, subDF....subDF] #up to 13 times
for subDF in lst:
    DF = pd.concat([DF, subDF])

Aren't they the same thing? Perhaps I'm misunderstanding the suggested workflow. Here's what I tested.

import numpy
import pandas as pd
import timeit

def test1():
    "make all subDF and then concatenate them"
    numpy.random.seed(1)
    subDF = pd.DataFrame(numpy.random.rand(1))
    lst = [subDF, subDF, subDF]
    DF = pd.DataFrame()
    for subDF in lst:
        DF = pd.concat([DF, subDF], axis=0,ignore_index=True)

def test2():
    "add each subDF to the collecitng DF as you're making the subDF"
    numpy.random.seed(1)
    DF = pd.DataFrame()
    for i in range(3):
        subDF = pd.DataFrame(numpy.random.rand(1))
        DF = pd.concat([DF, subDF], axis=0,ignore_index=True)

print('test1() takes {0} sec'.format(timeit.timeit(test1, number=1000)))
print('test2() takes {0} sec'.format(timeit.timeit(test2, number=1000)))

>> Output

test1() takes 12.732409087137057 sec
test2() takes 15.097430311612698 sec

I would appreciate your suggestions on efficient ways to concatenate multiple large data frames row-wise. Thanks!

Did you think of maybe drop them to one csv by appending data, Then reading that csv as DF. Shouldnt take hours. — Merlin, Jul 07 '16 at 13:09

Alicia Garcia-Raboso · Accepted Answer · 2016-07-07T13:39:31.667

Create a list with all your data frames:

dfs = []
for i in range(13):
    df = ... # However it is that you create your dataframes   
    dfs.append(df)

Then concatenate them in one swoop:

merged = pd.concat(dfs) # add ignore_index=True if appropriate

This is a lot faster than your code because it creates exactly 14 dataframes (your original 13 plus merged), while your code creates 26 of them (your original 13 plus 13 intermediate merges).

EDIT:

Here's a variation on your testing code.

import numpy
import pandas as pd
import timeit

def test_gen_time():
    """Create three large dataframes, but don't concatenate them"""
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))

def test_sequential_concat():
    """Create three large dataframes, concatenate them one by one"""
    DF = pd.DataFrame()
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))
        DF = pd.concat([DF, df], ignore_index=True)

def test_batch_concat():
    """Create three large dataframes, concatenate them at the end"""
    dfs = []
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))
        dfs.append(df)
    DF = pd.concat(dfs, ignore_index=True)

print('test_gen_time() takes {0} sec'
          .format(timeit.timeit(test_gen_time, number=200)))
print('test_sequential_concat() takes {0} sec'
          .format(timeit.timeit(test_sequential_concat, number=200)))
print('test_batch_concat() takes {0} sec'
          .format(timeit.timeit(test_batch_concat, number=200)))

Output:

test_gen_time() takes 10.095820872998956 sec
test_sequential_concat() takes 17.144756617000894 sec
test_batch_concat() takes 12.99131180600125 sec

The lion's share corresponds to generating the dataframes. Batch concatenation takes around 2.9 seconds; sequential concatenation takes more than 7 seconds.

@sedeh if this or any answer has solved your question please consider [accepting it](http://meta.stackexchange.com/q/5234/179419) by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. — Alicia Garcia-Raboso, Jul 07 '16 at 14:38

Efficient way to combine pandas data frames row-wise

1 Answers1

Linked