0

I've been reading different post here on this topic but I didn't really manage to find an answer.

I have a folder with 50 files (total size around 70GB). I would like to open them and create a single DataFrame on which perform computation.

Unfortunately, I go out of memory if I try to concatenate the files immediately after I open them.

I know I can operate on chunks and work on smaller subsets, but these 50 files are already a small portion of the entire dataset.

Thus, I found an alternative that works (using for loops) and deleting the list at each iteration, but that obviously takes too much time. I had a tentative with "pd.appned" but in that case I go out of Memory.

for file in lst:
    df_list.append(open_file(file))
     #print(file)
result = df_list[len(df_list)-1]
del df_list[len(df_list)-1]

for i in range(len(df_list)-1, -1,-1):
    print(i)
    result = pd.concat([result, df_list[i]], ignore_index=True)
    del df_list[i]

Although working, I feel like I'm doing things twice. Furthermore, putting pd.concat in a loop is very bad idea, since time increase exponentially (Why does concatenation of DataFrames get exponentially slower?).

Does anyone have any suggestion?

Now opening and concatenation takes 75 mins + 105 mins. I hope to reduce this time.

081N
  • 35
  • 4

0 Answers0