pandas groupby with sum() on large csv file?

Question

I have a big file (19GB or so) that I want to load in memory to perform an aggregation over some columns.

the file looks like this:

id, col1, col2, col3, 
1 ,  12 , 15 , 13 
2 ,  18 , 15 , 13 
3 ,  14 , 15 , 13 
3 ,  14 , 185 , 213

notice that, I am using the columns (id, col1) for the aggregation after loading into the data frame, notice also that these keys might be repeated successively for few times, like:

3 ,  14 , 15 , 13 
3 ,  14 , 185 , 213

For a small file, the following script can do the job

import pandas as pd
data = pd.read_csv("data_file", delimiter=",")
data = data.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum()

However, for a large file, I need to use chunksize when reading the csv file to limit the number of rows loaded into memory:

import pandas as pd
data = pd.read_csv("data_file", delimiter=",", chunksize=1000000)
data = data.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum()

In the latter case, there will be a problem if the rows where (id, col1) are similar are split in different files. How can I deal with that?

EDIT

As pointed out by @EdChum, there is a potential workaround, that is to not just append the groupby results to a new csv and read that back in and perform the aggregation again until the df size doesn't change.

This, however, have a worst case scenario that is not handled, that is:

when all files( or sufficiently many files as the memory can't handle) have the same problematic similar (id, col1) at the end. This will cause the system to return a MemoryError

Could you not just append the groupby results to a new csv and read that back in and perform the aggregation again until the df size doesn't change? — EdChum, Nov 05 '15 at 11:09
@EdChum, thanks. that might be a good idea. I will consider that if no other "native" pandas options. — Mohamed Ali JAMAOUI, Nov 05 '15 at 11:11
@EdChum, there is a worst case scenario however, that when all files( or sufficiently many files as the memory can't handle) have, the same problematic similar (id, col1) at the end. This will cause the system to return a MemoryError (let me know If i need to further explain) — Mohamed Ali JAMAOUI, Nov 05 '15 at 11:12

score 11 · Answer 1 · answered Nov 05 '15 at 15:47

dask solution

Dask.dataframe can almost do this without modification

$ cat so.csv
id,col1,col2,col3
1,13,15,14
1,13,15,14
1,12,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
3,14,15,13
3,14,15,13
3,14,185,213

$ pip install dask[dataframe]
$ ipython

In [1]: import dask.dataframe as dd

In [2]: df = dd.read_csv('so.csv', sep=',')

In [3]: df.head()
Out[3]: 
   id  col1  col2  col3
0   1    13    15    14
1   1    13    15    14
2   1    12    15    13
3   2    18    15    13
4   2    18    15    13

In [4]: df.groupby(['id', 'col1']).sum().compute()
Out[4]: 
         col2  col3
id col1            
1  12      15    13
   13      30    28
2  18      90    78
3  14     215   239

No one has written as_index=False for groupby though. We can work around this with assign.

In [5]: df.assign(id_2=df.id, col1_2=df.col1).groupby(['id_2', 'col1_2']).sum().compute()
Out[5]: 
             id  col1  col2  col3
id_2 col1_2                      
1    12       1    12    15    13
     13       2    26    30    28
2    18      12   108    90    78
3    14       9    42   215   239

How this works

We'll pull out chunks and do groupbys just like in your first example. Once we're done grouping and summing each of the chunks we'll gather all of the intermediate results together and do another slightly different groupby.sum. This makes the assumption that the intermediate results will fit in memory.

Parallelism

As a pleasant side effect, this will also operate in parallel.

This is a good solution. Unfortunately, I don't have the permission to install dask on the deployment server and I can only use native pandas. — Mohamed Ali JAMAOUI, Nov 05 '15 at 15:55
The recommended way to install pandas is to use conda, see docs here: http://pandas.pydata.org/pandas-docs/stable/install.html; this installs in a user environment, allows version control, is a private virtual environment and incidentally allows you to install dask (which is pip installable as well) — Jeff, Nov 05 '15 at 16:09
dask doesn't even support loading `gzip` files and breaking them into parts for parallel purpose? — avocado, Dec 11 '17 at 15:58

score 1 · Accepted Answer · edited May 23 '17 at 12:24

Firstly you can choose list of unique constants by read csv with usecols - usecols=['id', 'col1']. Then read csv by chunks, concat chunks by subset of id and groupby. better explain.

If better is use column col1, change constants = df['col1'].unique().tolist(). It depends on your data.

Or you can read only one column df = pd.read_csv(io.StringIO(temp), sep=",", usecols=['id']), it depends on your data.

import pandas as pd
import numpy as np
import io

#test data
temp=u"""id,col1,col2,col3
1,13,15,14
1,13,15,14
1,12,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
3,14,15,13
3,14,15,13
3,14,185,213"""
df = pd.read_csv(io.StringIO(temp), sep=",", usecols=['id', 'col1'])
#drop duplicities, from out you can choose constant
df = df.drop_duplicates()
print df
#   id  col1
#0   1    13
#2   1    12
#3   2    18
#9   3    14

#for example list of constants
constants = [1,2,3]
#or column id to list of unique values
constants = df['id'].unique().tolist()
print constants
#[1L, 2L, 3L]

for i in constants:
    iter_csv = pd.read_csv(io.StringIO(temp), delimiter=",", chunksize=10)
    #concat subset with rows id == constant
    df = pd.concat([chunk[chunk['id'] == i] for chunk in iter_csv])
    #your groupby function
    data = df.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum()
    print data.to_csv(index=False)

    #id,col1,col2,col3
    #1,12,15,13
    #1,13,30,28
    #
    #id,col1,col2,col3
    #2,18,90,78
    #
    #id,col1,col2,col3
    #3,14,215,239

Thanks for your proposition, I don't clearly get the idea behind. But I don't want to drop any of the columns, as the results of the sum aggregation will be altered if I do so. — Mohamed Ali JAMAOUI, Nov 05 '15 at 12:40
you drop some columns only for choose constants. And use usecols because it is large df. — jezrael, Nov 05 '15 at 12:43

pandas groupby with sum() on large csv file?

2 Answers2

dask solution

How this works

Parallelism

Linked