3

I have generated a co-occurrence matrix by using the Python pandas library, with the following code:

# dfdo is an ordered dictionary with a key called KEY453    

df = pd.DataFrame(dfdo).set_index('KEY453')
df_asint = df.astype(int)
com = df_asint.T.dot(df_asint)

It follows the same procedure as this question.

My question is, how can I find the top 2 strings which co-occur with a given string in the matrix? For example, The top 2 strings that co-occur with Dog in the example below are Cat and Zebra.

       Cat  Dog Zebra
Cat     0    2    3
Dog     2    0    1
Zebra   3    1    0
Paradox
  • 4,602
  • 12
  • 44
  • 88

2 Answers2

3

I think you can use nlargest:

print (df.loc['Dog'].nlargest(2))
Cat      2
Zebra    1
Name: Dog, dtype: int64

print (df.loc['Dog'].nlargest(2).index)
Index(['Cat', 'Zebra'], dtype='object')

If need all values of DataFrame use numpy.argsort:

print (np.argsort(-df.values, axis=1)[:, :2])
[[2 1]
 [0 2]
 [0 1]]

print (df.columns[np.argsort(-df.values, axis=1)[:, :2]])
Index([['Zebra', 'Dog'], ['Cat', 'Zebra'], ['Cat', 'Dog']], dtype='object')

print (pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)[:, :2]], 
                               index=df.index, 
                               columns=['first','second']))

       first second
Cat    Zebra    Dog
Dog      Cat  Zebra
Zebra    Cat    Dog

or apply:

print (df.apply(lambda x: pd.Series(x.nlargest(2).index, index=['first','second']), axis=1))
       first second
Cat    Zebra    Dog
Dog      Cat  Zebra
Zebra    Cat    Dog
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
3

option 1
stack then nlargest

df.stack().nlargest(1)

Cat  Zebra    3
dtype: int64

option 2
stack then idxmax

df.stack().idxmax()

('Cat', 'Zebra')
piRSquared
  • 285,575
  • 57
  • 475
  • 624