Memory-efficient way to merge large list of series into dataframe

Question

I have a large list of pandas.Series that I would like to merge into a single DataFrame. This list is produced asynchronously using multiprocessing.imap_unordered and new pd.Series objects come in every few seconds. My current approach is to call pd.DataFrame on the list of series as follows:

timeseries_lst = []
counter = 0
for timeseries in pool.imap_unordered(CPU_intensive_function, args):
    if counter % 500 == 0:
        logger.debug(f"Finished creating timeseries for {counter} out of {nr_jobs}")
    counter += 1
    timeseries_lst.append(timeseries)
timeseries_df = pd.DataFrame(timeseries_lst)

The problem is that during the last line, my available RAM is all used up (I get an exit code 137 error). Unfortunately it is not possible to provide a runnable example, because the data is several 100 GB large. Increasing the Swap-Memory is not a feasible option since the available RAM is already quite large (about 1 TB) and a bit of Swap-Memory is not going to make much of a difference.

My idea is that one could, at regular intervals of maybe 500 iterations, add the new series to a growing dataframe. This would allow for cleaning the timeseries_lst and thereby reduce RAM intensity. My question would however be the following: What is the most efficient approach to do so? The options that I can think of are:

Create small dataframes with the new data and merge into the growing dataframe
Concat the growing dataframe and the new series

Does anybody know which of these two would be more efficient? Or maybe have a better idea? I have seen this answer, but this would not really reduce RAM usage since the small dataframes need to be held in memory.

Thanks a lot!

Edit: Thanks to Timus, I am one step further Pandas uses the following code when creating a DataFrame:

        elif is_list_like(data):
            if not isinstance(data, (abc.Sequence, ExtensionArray)):
                data = list(data) <-- We don't want this

So how would a generator function have to look like to be considered an instance of either abc.Sequence or ExtensionArray? Thanks!

I don't think you need the list at all: `timeseries_df = pd.DataFrame(pool.imap_unordered(CPU_intensive_function, args))` should do the same? — Timus, Dec 09 '21 at 09:22
True, but that wouldn't solve the issue for two reasons: The full iterable would still be generated and passed as argument to DataFrame. The series would therefore exist twice. And I used the imap_unordered to get some intermediary feedback how far the script is. I would ideally like to keep that... — C Hecht, Dec 09 '21 at 09:36
Re _The series would therefore exist twice_: Only one at a time. Re your 2. point: Just wrap it into a generator (function). — Timus, Dec 09 '21 at 09:39
That is indeed a good trick. Checking the way pandas parses the iterable, the generator would have to pass is_list_like and also be one of abc.Sequence or ExtensionArray. If it is not one of the latter, pandas calls list(data) internally. Do you know what a generator function would have to fulfil to be classified as such? — C Hecht, Dec 09 '21 at 10:07
I'd probably just spill to disk (eg. into an SQLite database...), then read the final dataframe back in. — AKX, Dec 09 '21 at 10:13
I'd try a function that takes an iterator as argument (you'll give it `pool.imap_unordered(CPU_intensive_function, args)`), loops over it like you do in the `for`-loop, but then does `yield timeseries` instead of `timeseries_lst.append(timeseries)`. — Timus, Dec 09 '21 at 11:30
@Timus: Unfortunately, pandas will internally call the list function on the provided iterator and thereby creating a full dataset. I will see whether there is a way around the issue, but that will certainly be tricky — C Hecht, Dec 09 '21 at 14:06

Memory-efficient way to merge large list of series into dataframe

0 Answers0