I have a big file (19GB or so) that I want to load in memory to perform an aggregation over some columns.
the file looks like this:
id, col1, col2, col3,
1 , 12 , 15 , 13
2 , 18 , 15 , 13
3 , 14 , 15 , 13
3 , 14 , 185 , 213
notice that, I am using the columns (id, col1) for the aggregation after loading into the data frame, notice also that these keys might be repeated successively for few times, like:
3 , 14 , 15 , 13
3 , 14 , 185 , 213
For a small file, the following script can do the job
import pandas as pd
data = pd.read_csv("data_file", delimiter=",")
data = data.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum()
However, for a large file, I need to use chunksize when reading the csv file to limit the number of rows loaded into memory:
import pandas as pd
data = pd.read_csv("data_file", delimiter=",", chunksize=1000000)
data = data.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum()
In the latter case, there will be a problem if the rows where (id, col1) are similar are split in different files. How can I deal with that?
EDIT
As pointed out by @EdChum, there is a potential workaround, that is to not just append the groupby results to a new csv and read that back in and perform the aggregation again until the df size doesn't change.
This, however, have a worst case scenario that is not handled, that is:
when all files( or sufficiently many files as the memory can't handle) have the same problematic similar (id, col1) at the end. This will cause the system to return a MemoryError