I've 14 data frames each with 14 columns and more than 250,000 rows. The data frame have identical column headers and I would like to merge the data frames row-wise. I attempted to concatenate the data frames to a 'growing' DataFrame and it's taking several hours.
Essentially, I was doing something like below 13 times:
DF = pd.DataFrame()
for i in range(13):
DF = pd.concat([DF, subDF])
The stackoverflow answer here suggests appending all sub data frames to a list and then concatenating the list of sub data frames.
That sounds like doing something like this:
DF = pd.DataFrame()
lst = [subDF, subDF, subDF....subDF] #up to 13 times
for subDF in lst:
DF = pd.concat([DF, subDF])
Aren't they the same thing? Perhaps I'm misunderstanding the suggested workflow. Here's what I tested.
import numpy
import pandas as pd
import timeit
def test1():
"make all subDF and then concatenate them"
numpy.random.seed(1)
subDF = pd.DataFrame(numpy.random.rand(1))
lst = [subDF, subDF, subDF]
DF = pd.DataFrame()
for subDF in lst:
DF = pd.concat([DF, subDF], axis=0,ignore_index=True)
def test2():
"add each subDF to the collecitng DF as you're making the subDF"
numpy.random.seed(1)
DF = pd.DataFrame()
for i in range(3):
subDF = pd.DataFrame(numpy.random.rand(1))
DF = pd.concat([DF, subDF], axis=0,ignore_index=True)
print('test1() takes {0} sec'.format(timeit.timeit(test1, number=1000)))
print('test2() takes {0} sec'.format(timeit.timeit(test2, number=1000)))
>> Output
test1() takes 12.732409087137057 sec
test2() takes 15.097430311612698 sec
I would appreciate your suggestions on efficient ways to concatenate multiple large data frames row-wise. Thanks!