0

I want read csv file. It has 5 million rows and 13 columns. File size is 25GB. RAM server is 24GB.

df_list = []
chunksize = 100000
    
for chunk in pd.read_csv(path, chunksize=chunksize):
    df_list.append(chunk)

X = pd.concat(df_list)

After running a moment, it stop and error.

I want to stop or something if memory / RAM is 20GB.

Tawan
  • 49
  • 4
  • How many columns? Not helping, I know; I'm just curious since you said you only have 5 rows. –  Jul 10 '21 at 10:18
  • Split it by rows. There is even literal `split` command for that. Work with each part then combine results – Alex Yu Jul 10 '21 at 10:22
  • @JustinEzequiel Oh I put wrong. There are 5 million rows and 13 columns – Tawan Jul 10 '21 at 10:28
  • @AlexYu how to use split command? Can you give the example? – Tawan Jul 10 '21 at 10:29
  • @Tawan Look: https://stackoverflow.com/a/2016918/1168212 – Alex Yu Jul 10 '21 at 10:31
  • 2
    Do you really need to put everything in a list? Could you just process one chunk at a time and discard the previous chunk? – Barmar Jul 10 '21 at 10:33
  • @Barmar anyway is ok but I don't know how to do – Tawan Jul 10 '21 at 10:38
  • How are you trying to process the data? For example if you were just summing the first column, you would not need to read everything in at the same time. Could you show a worked example for say 5 rows in your question? i.e. show the desired output – Martin Evans Jul 10 '21 at 15:40

1 Answers1

0

Dask can be used for large file try this maybe helpful in your case

import time
from dask import dataframe as dd
start = time.time()
df = dd.read_csv(path)
end = time.time()
print("Read csv with dask: ",(end-start),"sec")
Maaz Irfan
  • 81
  • 8