I have the following df:
CUI1 CUI2 tot
0 C0000699 C3894683 2
1 C0000699 C0101725 1
2 C0000699 C1882413 3
... ... ... ...
9995 C0000715 C0026382 56
9996 C0000715 C0010334 101
...
which I need to transform into a co-occurrence matrix:
C0000699 C3894683 C0101725 C1882413 ... C0026382
C0000699 0 2 1 3 m
...
C3894683 2 0 n p ... q
...
etc..
The df is extremely large (~11 m rows), and I have tried looping through it to set columns (using unstack
and get_level
, etc.), but it is taking an inordinate amount of time and memory. Also, note, that pairs that have a count of 0 tot
are not given in the initial stacked df.
Any vectorized solutions to speed this up and lower the memory footprint? The closest I could find was this constructing-a-co-occurrence-matrix-in-python-pandas, but the df is structured differently.