1

I have to read a lot of .csv files in pandas and concatenate them. Total size is about 10 GB of data and concatenating all together gives me memory error.

I'm not sure I can read them file by file or chunk by chunk, because eventually I have to apply SMOTE for balancing the final dataframe, so I need the complete dataset.

How can I do?

Luigi Montaleone
  • 53
  • 1
  • 1
  • 4
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. – Community May 26 '22 at 15:08
  • Do you have enough RAM on your machine? – Sergey Sakharovskiy May 26 '22 at 15:32
  • No, only 16 GB. Since every file has a lot of columns, I cannot load everything all at once – Luigi Montaleone May 27 '22 at 08:47
  • @LuigiMontaleone - if you don't have enough RAM to load everything into memory you're going to need to give more details on what you're trying to do. I'm assuming you're doing some form of group-apply-combine? If so you could run through the files, determine what your groups are, then pull the groups one by one from the files – user1543042 May 29 '22 at 02:31
  • Alternatively you could use some big data package that doesn't need to hold everything in RAM such as pyspark – user1543042 May 29 '22 at 02:33

1 Answers1

-1

I don't know shat Smote is, but does this answer your question?

Import multiple csv files into pandas and concatenate into one DataFrame

Or, this.

https://pandas.pydata.org/docs/reference/api/pandas.concat.html

ASH
  • 20,759
  • 19
  • 87
  • 200