3

I have some data in a .csv file that describes connections in a network.

        index  c     id_1   id_2
0           0  1        8     10
1           1  1        7     10
2           2  1        7     10
3           3  1  2189149     29
4           4  1       27     29

where c denotes a connection. This data is shape (3114045, 4) and takes up ~100 mb.

I would like to count the number of times id_1 is connected with id_2. I can accomplish this either by doing

adj_pivot = pd.pivot_table(data=df, 
                     index="id_1", 
                     columns="id_2", 
                     values="c", 
                     aggfunc=np.sum)

or alternatively - and much faster - I can do

adj_group = df.groupby(["id_1", "id_2"]).size().unstack(fill_value=0)

either way, this gives me my desired output:

id_2     10   29
id_1            
7        2.0  0
8        1.0  0
27       0    1.0
2189149  0    1.0

My problem is, that if I do the above pivot/groupby with pandas I will need ~5300 GB ram.

The sparse version of the full (3114045, 4) structure takes up 56 bytes according to sys.getsizeof(scipy.sparse.csr_matrix(df)). Trying out the above method with 100 000 rows and then turning it sparse, it looks like I can compress the size of the matrix by a factor 10^-8.


So, my question is: How would I replicate the above pivot+sum/groupby+fill on a sparse structure? If that cannot be done, is there a good tactic for doing this in batches?

I have looked at the answer here, but it seems a bit cryptic to me still.

tmo
  • 1,393
  • 1
  • 17
  • 47

1 Answers1

1

This should work:

grouped = df.groupby(["id_1", "id_2"]).size().reset_index()
values = grouped.values.T
scipy.sparse.csr_matrix((values[2], (values[0], values[1])))

<2189150x30 sparse matrix of type '<class 'numpy.int64'>'
    with 4 stored elements in Compressed Sparse Row format>
w-m
  • 10,772
  • 1
  • 42
  • 49