-1

I have the following df:

          CUI1      CUI2  tot
0     C0000699  C3894683    2
1     C0000699  C0101725    1
2     C0000699  C1882413    3
...        ...       ...  ...
9995  C0000715  C0026382   56
9996  C0000715  C0010334   101
...

which I need to transform into a co-occurrence matrix:

          C0000699 C3894683 C0101725 C1882413 ... C0026382
C0000699  0        2        1        3            m
...
C3894683  2        0        n        p        ... q
...
etc..   

The df is extremely large (~11 m rows), and I have tried looping through it to set columns (using unstack and get_level, etc.), but it is taking an inordinate amount of time and memory. Also, note, that pairs that have a count of 0 tot are not given in the initial stacked df.

Any vectorized solutions to speed this up and lower the memory footprint? The closest I could find was this constructing-a-co-occurrence-matrix-in-python-pandas, but the df is structured differently.

martineau
  • 119,623
  • 25
  • 170
  • 301
horcle_buzz
  • 2,101
  • 3
  • 30
  • 59

1 Answers1

0

you can try this.

df.pivot_table(index='CUI1',columns='CUI2',values='tot')
AvivSar
  • 273
  • 1
  • 8