1

I have a large list of pandas.Series that I would like to merge into a single DataFrame. This list is produced asynchronously using multiprocessing.imap_unordered and new pd.Series objects come in every few seconds. My current approach is to call pd.DataFrame on the list of series as follows:

timeseries_lst = []
counter = 0
for timeseries in pool.imap_unordered(CPU_intensive_function, args):
    if counter % 500 == 0:
        logger.debug(f"Finished creating timeseries for {counter} out of {nr_jobs}")
    counter += 1
    timeseries_lst.append(timeseries)
timeseries_df = pd.DataFrame(timeseries_lst)

The problem is that during the last line, my available RAM is all used up (I get an exit code 137 error). Unfortunately it is not possible to provide a runnable example, because the data is several 100 GB large. Increasing the Swap-Memory is not a feasible option since the available RAM is already quite large (about 1 TB) and a bit of Swap-Memory is not going to make much of a difference.

My idea is that one could, at regular intervals of maybe 500 iterations, add the new series to a growing dataframe. This would allow for cleaning the timeseries_lst and thereby reduce RAM intensity. My question would however be the following: What is the most efficient approach to do so? The options that I can think of are:

  • Create small dataframes with the new data and merge into the growing dataframe
  • Concat the growing dataframe and the new series

Does anybody know which of these two would be more efficient? Or maybe have a better idea? I have seen this answer, but this would not really reduce RAM usage since the small dataframes need to be held in memory.

Thanks a lot!

Edit: Thanks to Timus, I am one step further Pandas uses the following code when creating a DataFrame:

        elif is_list_like(data):
            if not isinstance(data, (abc.Sequence, ExtensionArray)):
                data = list(data) <-- We don't want this

So how would a generator function have to look like to be considered an instance of either abc.Sequence or ExtensionArray? Thanks!

C Hecht
  • 932
  • 5
  • 14
  • 1
    I don't think you need the list at all: `timeseries_df = pd.DataFrame(pool.imap_unordered(CPU_intensive_function, args))` should do the same? – Timus Dec 09 '21 at 09:22
  • True, but that wouldn't solve the issue for two reasons: The full iterable would still be generated and passed as argument to DataFrame. The series would therefore exist twice. And I used the imap_unordered to get some intermediary feedback how far the script is. I would ideally like to keep that... – C Hecht Dec 09 '21 at 09:36
  • 1
    Re _The series would therefore exist twice_: Only one at a time. Re your 2. point: Just wrap it into a generator (function). – Timus Dec 09 '21 at 09:39
  • That is indeed a good trick. Checking the way pandas parses the iterable, the generator would have to pass is_list_like and also be one of abc.Sequence or ExtensionArray. If it is not one of the latter, pandas calls list(data) internally. Do you know what a generator function would have to fulfil to be classified as such? – C Hecht Dec 09 '21 at 10:07
  • 1
    I'd probably just spill to disk (eg. into an SQLite database...), then read the final dataframe back in. – AKX Dec 09 '21 at 10:13
  • 1
    I'd try a function that takes an iterator as argument (you'll give it `pool.imap_unordered(CPU_intensive_function, args)`), loops over it like you do in the `for`-loop, but then does `yield timeseries` instead of `timeseries_lst.append(timeseries)`. – Timus Dec 09 '21 at 11:30
  • @Timus: Unfortunately, pandas will internally call the list function on the provided iterator and thereby creating a full dataset. I will see whether there is a way around the issue, but that will certainly be tricky – C Hecht Dec 09 '21 at 14:06
  • I see - sorry for wasting your time :( – Timus Dec 09 '21 at 14:12
  • No worries, SO about finding a solution together ;-) – C Hecht Dec 09 '21 at 15:30

0 Answers0