To simulate your API data retrieval, I've created small toy function that gives us a row of data.
def createRow(i):
return ["test" + str(i), np.random.randint(0, 1000), np.random.randn(), 13245648]
So let's compare our options:
First, let's get the append out of the way, since it's painfully slow:
dtypes = np.dtype(
[
("a", str),
("b", int),
("c", float),
("d", float),
]
)
df: pd.DataFrame = pd.DataFrame(np.empty(0, dtype=dtypes))
for i in range(10000):
df.loc[i] = createRow(i)
With profiler complaining about this taking quite some time e.g. ~34 seconds:
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.124 0.000 34.290 0.003 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\indexing.py:831(__setitem__)
10000 0.112 0.000 32.339 0.003 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\indexing.py:1689(_setitem_with_indexer)
10000 0.409 0.000 32.187 0.003 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\indexing.py:2080(_setitem_with_indexer_missing)
This answer gives us a reason: apparently, this sort of append creates a new dataframe during every step which is bad.
Concat is even worse, with ~56 seconds:
df: pd.DataFrame = pd.DataFrame(np.empty(0, dtype=dtypes))
for i in range(10000):
pd.concat([df, pd.DataFrame(createRow(i))])
Nothing too surprising, since it takes a new dataframe at every step, wraps both old and new dataframe into a list and then concatenates them. Ouch.
(Yes, pd.concat requires a collection of dataframes to work with. yay.)
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.215 0.000 56.281 0.006 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\reshape\concat.py:149(concat)
10000 0.742 0.000 37.292 0.004 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\reshape\concat.py:565(get_result)
10000 0.693 0.000 22.621 0.002 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\internals\concat.py:176(concatenate_managers)
The recommended way is to create the list as you're already doing, then convert it to a dataframe.
l = []
for i in range(1000000):
l.append(createRow(i))
df = pd.DataFrame(l, columns=['a', 'b', 'c', 'd'])
Suddenly, the bottleneck becomes the api(or in our case the generating function):
Note, that I've increased the number of iteration 100 times since it's so fast that 10000 iterations wasn't enough to see any meaningful values. (both append and concat are quadratic time-wise, this should be linear)
ncalls tottime percall cumtime percall filename:lineno(function)
1000000 1.830 0.000 10.026 0.000 C:\Users\\PycharmProjects\playground\main.py:7(createRow)
1000000 6.142 0.000 6.142 0.000 {method 'randint' of 'numpy.random.mtrand.RandomState' objects}
1000000 2.054 0.000 2.054 0.000 {method 'randn' of 'numpy.random.mtrand.RandomState' objects}
2 0.000 0.000 0.979 0.489 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\frame.py:641(__init__)
2 0.000 0.000 0.938 0.469 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\internals\construction.py:775(to_arrays)
1 0.000 0.000 0.937 0.937 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\internals\construction.py:484(nested_data_to_arrays)
So this should answer your question. Also, since the API now becomes the bottleneck, it would be overoptimization to try and reduce the time any more - unless for some reason your API is so fast that this doesn't happen.