Groupby id and unstack (or sum) on a large matrix (3x3 mio.)

Question

I have some data in a .csv file that describes connections in a network.

        index  c     id_1   id_2
0           0  1        8     10
1           1  1        7     10
2           2  1        7     10
3           3  1  2189149     29
4           4  1       27     29

where c denotes a connection. This data is shape (3114045, 4) and takes up ~100 mb.

I would like to count the number of times id_1 is connected with id_2. I can accomplish this either by doing

adj_pivot = pd.pivot_table(data=df, 
                     index="id_1", 
                     columns="id_2", 
                     values="c", 
                     aggfunc=np.sum)

or alternatively - and much faster - I can do

adj_group = df.groupby(["id_1", "id_2"]).size().unstack(fill_value=0)

either way, this gives me my desired output:

id_2     10   29
id_1            
7        2.0  0
8        1.0  0
27       0    1.0
2189149  0    1.0

My problem is, that if I do the above pivot/groupby with pandas I will need ~5300 GB ram.

The sparse version of the full (3114045, 4) structure takes up 56 bytes according to sys.getsizeof(scipy.sparse.csr_matrix(df)). Trying out the above method with 100 000 rows and then turning it sparse, it looks like I can compress the size of the matrix by a factor 10^-8.

So, my question is: How would I replicate the above pivot+sum/groupby+fill on a sparse structure? If that cannot be done, is there a good tactic for doing this in batches?

I have looked at the answer here, but it seems a bit cryptic to me still.

w-m · Accepted Answer · 2018-08-24T15:02:45.233

1

This should work:

grouped = df.groupby(["id_1", "id_2"]).size().reset_index()
values = grouped.values.T
scipy.sparse.csr_matrix((values[2], (values[0], values[1])))

<2189150x30 sparse matrix of type '<class 'numpy.int64'>'
    with 4 stored elements in Compressed Sparse Row format>

edited Aug 24 '18 at 15:02

answered Aug 24 '18 at 14:42

w-m

10,772
1
42
49

Groupby id and unstack (or sum) on a large matrix (3x3 mio.)

1 Answers1