I found it to be extremely slow if we initialize a pandas Series object from a list of DataFrames. E.g. the following code:
import pandas as pd
import numpy as np
# creating a large (~8GB) list of DataFrames.
l = [pd.DataFrame(np.zeros((1000, 1000))) for i in range(1000)]
# This line executes extremely slow and takes almost extra ~10GB memory. Why?
# It is even much, much slower than the original list `l` construction.
s = pd.Series(l)
Initially I thought the Series initialization accidentally deep-copied the DataFrames which make it slow, but it turned out that it's just copy by reference as the usual =
in python does.
On the other hand, if I just create a series and manually shallow copy elements over (in a for loop), it will be fast:
# This for loop is faster. Why?
s1 = pd.Series(data=None, index=range(1000), dtype=object)
for i in range(1000):
s1[i] = l[i]
What is happening here?
Real-life usage: I have a table loader which reads something on disk and returns a pandas DataFrame (a table). To expedite the reading, I use a parallel tool (from this answer) to execute multiple reads (each read is for one date for example), and it returns a list (of tables). Now I want to transform this list to a pandas Series object with a proper index (e.g. the date or file location used in the read), but the Series construction takes ridiculous amount of time (as the sample code shown above). I can of course write it as a for loop to solve the issue, but that'll be ugly. Besides I want to know what is really taking the time here. Any insights?