I have a 8 GB large csv file. My RAM size is 16 GB. When I try to read it in:
I get a memory error. So I tried to read it in using chunksize parameter, like this:
import pandas as pd
import csv
dtypes= { "Column1": str, "Column2": str
}
complete_data=pd.read_csv(r'C:\folder\test.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=1000000)
dfcompl = pd.concat(complete_data, ignore_index=True)
Again, I get a memory error. According to this solution I tried:
import pandas as pd
import csv
dtypes= { "Column1": str, "Column2": str
}
with pd.read_csv(r'C:\folder\test.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=1000000) as reader:
for chunk in reader:
process(chunk)
dfcompl = pd.concat(chunk)
But I get an error NameError: name 'process' is not defined
. Obviously I have to change the 'process'. However, I don't know what to do. Looks actually like a simple task, I hoped that simply adding chunksize would solve it, however I don't know how to solve this issue.
So how can I read in a large csv file and append to one dataframe with which I can work with in pandas?
I do not want to use dask.
My problem also is that even if I process chunk by chunk and export for example to pkl files, I still have the problem at the end, that if I try to conacatenate these pkl files, I get a memory error again.