Reading large csv 25GB. How to avoid when memory almost full?

Question

I want read csv file. It has 5 million rows and 13 columns. File size is 25GB. RAM server is 24GB.

df_list = []
chunksize = 100000
    
for chunk in pd.read_csv(path, chunksize=chunksize):
    df_list.append(chunk)

X = pd.concat(df_list)

After running a moment, it stop and error.

I want to stop or something if memory / RAM is 20GB.

How many columns? Not helping, I know; I'm just curious since you said you only have 5 rows. — , Jul 10 '21 at 10:18
Split it by rows. There is even literal `split` command for that. Work with each part then combine results — Alex Yu, Jul 10 '21 at 10:22
@JustinEzequiel Oh I put wrong. There are 5 million rows and 13 columns — Tawan, Jul 10 '21 at 10:28
Do you really need to put everything in a list? Could you just process one chunk at a time and discard the previous chunk? — Barmar, Jul 10 '21 at 10:33
How are you trying to process the data? For example if you were just summing the first column, you would not need to read everything in at the same time. Could you show a worked example for say 5 rows in your question? i.e. show the desired output — Martin Evans, Jul 10 '21 at 15:40

score 0 · Answer 1 · answered Jul 31 '23 at 13:04

0

Dask can be used for large file try this maybe helpful in your case

import time
from dask import dataframe as dd
start = time.time()
df = dd.read_csv(path)
end = time.time()
print("Read csv with dask: ",(end-start),"sec")

answered Jul 31 '23 at 13:04

Maaz Irfan

81
8

Reading large csv 25GB. How to avoid when memory almost full?

1 Answers1