I have a large list
of pandas.Series
that I would like to merge into a single DataFrame. This list is produced asynchronously using multiprocessing.imap_unordered
and new pd.Series
objects come in every few seconds. My current approach is to call pd.DataFrame
on the list of series as follows:
timeseries_lst = []
counter = 0
for timeseries in pool.imap_unordered(CPU_intensive_function, args):
if counter % 500 == 0:
logger.debug(f"Finished creating timeseries for {counter} out of {nr_jobs}")
counter += 1
timeseries_lst.append(timeseries)
timeseries_df = pd.DataFrame(timeseries_lst)
The problem is that during the last line, my available RAM is all used up (I get an exit code 137 error). Unfortunately it is not possible to provide a runnable example, because the data is several 100 GB large. Increasing the Swap-Memory is not a feasible option since the available RAM is already quite large (about 1 TB) and a bit of Swap-Memory is not going to make much of a difference.
My idea is that one could, at regular intervals of maybe 500 iterations, add the new series to a growing dataframe. This would allow for cleaning the timeseries_lst
and thereby reduce RAM intensity. My question would however be the following: What is the most efficient approach to do so? The options that I can think of are:
- Create small dataframes with the new data and merge into the growing dataframe
- Concat the growing dataframe and the new series
Does anybody know which of these two would be more efficient? Or maybe have a better idea? I have seen this answer, but this would not really reduce RAM usage since the small dataframes need to be held in memory.
Thanks a lot!
Edit: Thanks to Timus, I am one step further Pandas uses the following code when creating a DataFrame:
elif is_list_like(data):
if not isinstance(data, (abc.Sequence, ExtensionArray)):
data = list(data) <-- We don't want this
So how would a generator function have to look like to be considered an instance of either abc.Sequence
or ExtensionArray
? Thanks!