Read in very large csv in chunks and append to one dataframe to be used for computation

Question

I have a 8 GB large csv file. My RAM size is 16 GB. When I try to read it in:

I get a memory error. So I tried to read it in using chunksize parameter, like this:

import pandas as pd
import csv


dtypes= {     "Column1": str, "Column2": str
        }


complete_data=pd.read_csv(r'C:\folder\test.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=1000000)
dfcompl = pd.concat(complete_data, ignore_index=True)

Again, I get a memory error. According to this solution I tried:

import pandas as pd
import csv

dtypes= {     "Column1": str, "Column2": str
        }

with pd.read_csv(r'C:\folder\test.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=1000000) as reader:
    for chunk in reader:
        process(chunk)
        
dfcompl = pd.concat(chunk)

But I get an error NameError: name 'process' is not defined. Obviously I have to change the 'process'. However, I don't know what to do. Looks actually like a simple task, I hoped that simply adding chunksize would solve it, however I don't know how to solve this issue.

So how can I read in a large csv file and append to one dataframe with which I can work with in pandas?

I do not want to use dask.

My problem also is that even if I process chunk by chunk and export for example to pkl files, I still have the problem at the end, that if I try to conacatenate these pkl files, I get a memory error again.

It's very likely that a CSV file that's 8GB is going to need a lot more than that once it's loaded into a dataframe. You may simply be constrained by the small amount of RAM that you have — DarkKnight, Jan 05 '23 at 14:46

CaptainTrunky · Accepted Answer · 2023-01-05T13:18:50.173

There is no real point in reading csv file in chunks if you want to collect all chunks in a single data frame afterwards - it will require ~8Gb of memory anyway. The whole idea behind chunking is to process your data in parts, so you never require full memory for that (image if your CSV is not 8Gb, but 80!). That exactly happens in your second snippet:

import pandas as pd
import csv

dtypes= {     "Column1": str, "Column2": str
        }

with pd.read_csv(r'C:\folder\test.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=1000000) as reader:
    for chunk in reader:
        process(chunk)
        
dfcompl = pd.concat(chunk)

where process is some function that performs computations over your data. You have to implement it. You get a chunk of data, process and throw it away immediately.

Try to look at your task from this perspective, do you really need all data at once or you could do something incrementally, line-by-line, if put it to some extreme?

UPDATE

So if you want to collect only lines where Column3 < 900, you could do the following:

output = None

with pd.read_csv(r'C:\folder\test.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=1000000) as reader:
    for chunk in reader:
        filtered = chunk['Column3'] < 900

        if output is None:
            output = filtered
            continue

        output = pd.concat([filtered, output])
        
output.to_csv(some_output_path)

It's not optimal and a bit ugly, but illustrates the idea.

Ok+1, that clarifies it. Could you give an example how I would process chunks? So lets say I have the big csv file. I read it in using chunks. Then I want to have only records where value of Column3 (integer) is less than 900. At the end I want to export all of these into one csv file. That file will be smaller of course than the original large csv file. So I need to extract all the records with Column3 < 900 in a separate csv file. — PSt, Jan 05 '23 at 13:03
I am not getting this with the def function and the output. Is it possible to do it without wrapping it in a function? And where is the to_csv to export it to a csv? — PSt, Jan 05 '23 at 13:13

Kanakorn Horsiritham · Answer 2 · 2023-01-05T13:17:23.413

I think the original post may refer to his own function 'process'

try this

import pandas as pd

dtypes= {     "Column1": str, "Column2": str
        }

dfcompl=None
with pd.read_csv(r'C:\folder\test.csv'
  , sep=";", encoding="utf-8"
  , dtype=dtypes
  , decimal=","
  , chunksize=1000000) as reader:
    for chunk in reader:
        if dfcompl is None:
            dfcompl=chunk
        else:
            dfcompl=pd.concat([dfcompl,chunk], axis=0)
        
dfcompl

dfcompl is a final dataframe you need, I initial it with None, then using pd.concat([ dfcompl, chunk ], axis=0) the previous one with each chunk. Note that axis=0 is the way to concat or append at the end, so far.

I hope this help.

Read in very large csv in chunks and append to one dataframe to be used for computation

2 Answers2