0

I am downloading data from an API and using them to create a pandas DataFrame. I am downloading the data row by row, and will often download over a million rows. Currently, I created a list object, append everything to it, then convert the full list into a DataFrame at the end (see example code below). Would it be faster to create a DataFrame and use the append or concat methods with each row, or maybe another trick I might be missing? I also thought about using a numpy ndarray that I would then convert into a DataFrame. It feels like the list being such a basic object, my approach would be faster, but is this reasoning right ?

data_list = []
requested_data = (API CALL HERE)
for x in requested_data:
    data_list.append(x)
data_list = pd.DataFrame(list)
Eboyer
  • 85
  • 1
  • 6
  • 1
    do not use **list** as variable name – RomanPerekhrest Aug 07 '23 at 09:23
  • About the above comment: https://docs.python.org/3/tutorial/datastructures.html *list* is a Python data type with methods attached. It isn't wise to assign it as a variable name. Choose something else than this exact word. – OCa Aug 07 '23 at 10:09
  • What is the type of requested_data? What I find a little confusing, is that it looks like an iterable, but you still make a list out of it. This is possibly a redundant operation. Could you give an example? or a significant part of it. Yes the concatenation is usually advantageous, but need more details to help. – OCa Aug 07 '23 at 10:23

1 Answers1

1

Let's benchmark:

To simulate your API data retrieval, I've created small toy function that gives us a row of data.

def createRow(i):
    return ["test" + str(i), np.random.randint(0, 1000), np.random.randn(), 13245648]

So let's compare our options: First, let's get the append out of the way, since it's painfully slow:

dtypes = np.dtype(
    [
        ("a", str),
        ("b", int),
        ("c", float),
        ("d", float),
    ]
)
df: pd.DataFrame = pd.DataFrame(np.empty(0, dtype=dtypes))
for i in range(10000):
    df.loc[i] = createRow(i)

With profiler complaining about this taking quite some time e.g. ~34 seconds:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
10000    0.124    0.000   34.290    0.003 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\indexing.py:831(__setitem__)
10000    0.112    0.000   32.339    0.003 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\indexing.py:1689(_setitem_with_indexer)
10000    0.409    0.000   32.187    0.003 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\indexing.py:2080(_setitem_with_indexer_missing)

This answer gives us a reason: apparently, this sort of append creates a new dataframe during every step which is bad.

Concat is even worse, with ~56 seconds:

df: pd.DataFrame = pd.DataFrame(np.empty(0, dtype=dtypes))
for i in range(10000):
   pd.concat([df, pd.DataFrame(createRow(i))])

Nothing too surprising, since it takes a new dataframe at every step, wraps both old and new dataframe into a list and then concatenates them. Ouch. (Yes, pd.concat requires a collection of dataframes to work with. yay.)

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
10000    0.215    0.000   56.281    0.006 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\reshape\concat.py:149(concat)
10000    0.742    0.000   37.292    0.004 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\reshape\concat.py:565(get_result)
10000    0.693    0.000   22.621    0.002 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\internals\concat.py:176(concatenate_managers)

The recommended way is to create the list as you're already doing, then convert it to a dataframe.

l = []
for i in range(1000000):
    l.append(createRow(i))
df = pd.DataFrame(l, columns=['a', 'b', 'c', 'd'])

Suddenly, the bottleneck becomes the api(or in our case the generating function): Note, that I've increased the number of iteration 100 times since it's so fast that 10000 iterations wasn't enough to see any meaningful values. (both append and concat are quadratic time-wise, this should be linear)

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000000    1.830    0.000   10.026    0.000 C:\Users\\PycharmProjects\playground\main.py:7(createRow)
  1000000    6.142    0.000    6.142    0.000 {method 'randint' of 'numpy.random.mtrand.RandomState' objects}
  1000000    2.054    0.000    2.054    0.000 {method 'randn' of 'numpy.random.mtrand.RandomState' objects}
        2    0.000    0.000    0.979    0.489 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\frame.py:641(__init__)
        2    0.000    0.000    0.938    0.469 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\internals\construction.py:775(to_arrays)
        1    0.000    0.000    0.937    0.937 C:\Users\\PycharmProjects\playground\venv\lib\site-packages\pandas\core\internals\construction.py:484(nested_data_to_arrays)

So this should answer your question. Also, since the API now becomes the bottleneck, it would be overoptimization to try and reduce the time any more - unless for some reason your API is so fast that this doesn't happen.

Shamis
  • 2,544
  • 10
  • 16