Slow Pandas Series initialization from list of DataFrames

Question

I found it to be extremely slow if we initialize a pandas Series object from a list of DataFrames. E.g. the following code:

import pandas as pd
import numpy as np

# creating a large (~8GB) list of DataFrames.
l = [pd.DataFrame(np.zeros((1000, 1000))) for i in range(1000)]

# This line executes extremely slow and takes almost extra ~10GB memory. Why?
# It is even much, much slower than the original list `l` construction.
s = pd.Series(l)

Initially I thought the Series initialization accidentally deep-copied the DataFrames which make it slow, but it turned out that it's just copy by reference as the usual = in python does.

On the other hand, if I just create a series and manually shallow copy elements over (in a for loop), it will be fast:

# This for loop is faster. Why?
s1 = pd.Series(data=None, index=range(1000), dtype=object)
for i in range(1000):
    s1[i] = l[i]

What is happening here?

Real-life usage: I have a table loader which reads something on disk and returns a pandas DataFrame (a table). To expedite the reading, I use a parallel tool (from this answer) to execute multiple reads (each read is for one date for example), and it returns a list (of tables). Now I want to transform this list to a pandas Series object with a proper index (e.g. the date or file location used in the read), but the Series construction takes ridiculous amount of time (as the sample code shown above). I can of course write it as a for loop to solve the issue, but that'll be ugly. Besides I want to know what is really taking the time here. Any insights?

I cannot reproduce `pd.Series(l)` taking too long (on Google Colab with `i in range(600)` due to limited RAM). Perhaps your machine starts using swap memory? — hilberts_drinking_problem, Dec 30 '21 at 03:44
@hilberts_drinking_problem: Strange indeed. I tried on Google Colab as well and it aligns with what you found: `pd.Series(l)` doesn't seem to have a problem there. I also suspected it's related to pandas version (mine is 1.3.5 and the Google Colab runtime I tried was using 1.1.5). However, after I switched to 1.1.5 (the one Google Colab used), the issue still occurs (on my mac book pro). This is bizarre. Also I don't think it's related to swap memory: even if I downsize it to `i in range(500)` on my mac, I still see it happening (and the memory is abundant). — Fei Liu, Dec 30 '21 at 15:34

SultanOrazbayev · Answer 1 · 2021-12-30T05:10:51.963

This is not a direct answer to the OP's question (what's causing the slow-down when constructing a series from a list of dataframes):

I might be missing an important advantage of using pd.Series to store a list of dataframes, however if that's not critical for downstream processes, then a better option might be to either store this as a dictionary of dataframes or to concatenate into a single dataframe.

For the dictionary of dataframes, one could use something like:

d = {n: df for n, df in enumerate(l)}
# can change the key to something more useful in downstream processes

For concatenation:

w = pd.concat(l, axis=1)
# note that when using with the snippet in this question
# the column names will be duplicated (because they have
# the same names) but if your actual list of dataframes
# contains unique column names, then the concatenated
# dataframe will act as a normal dataframe with unique
# column names

Slow Pandas Series initialization from list of DataFrames

1 Answers1