I have some data in a .csv file that describes connections in a network.
index c id_1 id_2
0 0 1 8 10
1 1 1 7 10
2 2 1 7 10
3 3 1 2189149 29
4 4 1 27 29
where c
denotes a connection. This data is shape (3114045, 4)
and takes up ~100 mb.
I would like to count the number of times id_1 is connected with id_2. I can accomplish this either by doing
adj_pivot = pd.pivot_table(data=df,
index="id_1",
columns="id_2",
values="c",
aggfunc=np.sum)
or alternatively - and much faster - I can do
adj_group = df.groupby(["id_1", "id_2"]).size().unstack(fill_value=0)
either way, this gives me my desired output:
id_2 10 29
id_1
7 2.0 0
8 1.0 0
27 0 1.0
2189149 0 1.0
My problem is, that if I do the above pivot/groupby with pandas I will need ~5300 GB ram.
The sparse version of the full (3114045, 4)
structure takes up 56 bytes according to sys.getsizeof(scipy.sparse.csr_matrix(df))
. Trying out the above method with 100 000
rows and then turning it sparse, it looks like I can compress the size of the matrix by a factor 10^-8
.
So, my question is: How would I replicate the above pivot+sum/groupby+fill on a sparse structure? If that cannot be done, is there a good tactic for doing this in batches?
I have looked at the answer here, but it seems a bit cryptic to me still.