I've been reading different post here on this topic but I didn't really manage to find an answer.
I have a folder with 50 files (total size around 70GB). I would like to open them and create a single DataFrame on which perform computation.
Unfortunately, I go out of memory if I try to concatenate the files immediately after I open them.
I know I can operate on chunks and work on smaller subsets, but these 50 files are already a small portion of the entire dataset.
Thus, I found an alternative that works (using for loops) and deleting the list at each iteration, but that obviously takes too much time. I had a tentative with "pd.appned" but in that case I go out of Memory.
for file in lst:
df_list.append(open_file(file))
#print(file)
result = df_list[len(df_list)-1]
del df_list[len(df_list)-1]
for i in range(len(df_list)-1, -1,-1):
print(i)
result = pd.concat([result, df_list[i]], ignore_index=True)
del df_list[i]
Although working, I feel like I'm doing things twice. Furthermore, putting pd.concat in a loop is very bad idea, since time increase exponentially (Why does concatenation of DataFrames get exponentially slower?).
Does anyone have any suggestion?
Now opening and concatenation takes 75 mins + 105 mins. I hope to reduce this time.