My use case appears to be different from the suggested answers to similar questions. I need to iterate over a list of Git repos using the GitPython module, do a shallow clone, iterate over each branch, and then run an operation on the contents of each branch. The result of each operation will be captured as a Dataframe with data in specific columns.
It's been suggested that I could use a ThreadPoolExecutor to possibly do this and to grab the Dataframe object resulting from each repo's output and then aggregating them into a single dataframe. I could use the to_csv() function to create a single file for each repo and branch and then aggregate them when the pool finishes, but I'm wondering if I can do the same without going the file creation route of the CSVs and do it all in memory. Or is it possible for each thread to add rows to a single aggregate dataframe without overwriting data?
Any feedback on the pros and cons of various approaches would be appreciated.