Aggregate the output of several threads' Dataframes into a single Pandas Dataframe

Question

My use case appears to be different from the suggested answers to similar questions. I need to iterate over a list of Git repos using the GitPython module, do a shallow clone, iterate over each branch, and then run an operation on the contents of each branch. The result of each operation will be captured as a Dataframe with data in specific columns.

It's been suggested that I could use a ThreadPoolExecutor to possibly do this and to grab the Dataframe object resulting from each repo's output and then aggregating them into a single dataframe. I could use the to_csv() function to create a single file for each repo and branch and then aggregate them when the pool finishes, but I'm wondering if I can do the same without going the file creation route of the CSVs and do it all in memory. Or is it possible for each thread to add rows to a single aggregate dataframe without overwriting data?

Any feedback on the pros and cons of various approaches would be appreciated.

score 0 · Answer 1 · answered Feb 06 '23 at 20:57

My Intention would be as well not to write everything to a CSV file. That costs time which might be avoided if you would be able to write the output to a list. Just create a list of DFs, each time a sub process is finished with reading the file add the result to a list. Once all DFs have been loaded and added to a list it is pretty easy to merge (actually it is a concat) a list of df to one DF.

Here an example how to merge a list of DFs to one df

Aggregate the output of several threads' Dataframes into a single Pandas Dataframe

1 Answers1