Python Pandas how to find top string which co-occurs?

Question

I have generated a co-occurrence matrix by using the Python pandas library, with the following code:

# dfdo is an ordered dictionary with a key called KEY453    

df = pd.DataFrame(dfdo).set_index('KEY453')
df_asint = df.astype(int)
com = df_asint.T.dot(df_asint)

It follows the same procedure as this question.

My question is, how can I find the top 2 strings which co-occur with a given string in the matrix? For example, The top 2 strings that co-occur with Dog in the example below are Cat and Zebra.

       Cat  Dog Zebra
Cat     0    2    3
Dog     2    0    1
Zebra   3    1    0

Do you think `df.idxmax()` ? – jezrael Nov 15 '16 at 13:43 — jezrael, Nov 15 '16 at 13:43
jezrael: edited for more clarity. – Paradox Nov 15 '16 at 13:57 — Paradox, Nov 15 '16 at 13:57

jezrael · Accepted Answer · 2016-11-15T14:05:17.793

I think you can use nlargest:

print (df.loc['Dog'].nlargest(2))
Cat      2
Zebra    1
Name: Dog, dtype: int64

print (df.loc['Dog'].nlargest(2).index)
Index(['Cat', 'Zebra'], dtype='object')

If need all values of DataFrame use numpy.argsort:

print (np.argsort(-df.values, axis=1)[:, :2])
[[2 1]
 [0 2]
 [0 1]]

print (df.columns[np.argsort(-df.values, axis=1)[:, :2]])
Index([['Zebra', 'Dog'], ['Cat', 'Zebra'], ['Cat', 'Dog']], dtype='object')

print (pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)[:, :2]], 
                               index=df.index, 
                               columns=['first','second']))

       first second
Cat    Zebra    Dog
Dog      Cat  Zebra
Zebra    Cat    Dog

or apply:

print (df.apply(lambda x: pd.Series(x.nlargest(2).index, index=['first','second']), axis=1))
       first second
Cat    Zebra    Dog
Dog      Cat  Zebra
Zebra    Cat    Dog

piRSquared · Answer 2 · 2016-11-15T14:51:02.947

3

option 1
stack then nlargest

df.stack().nlargest(1)

Cat  Zebra    3
dtype: int64

option 2
stack then idxmax

df.stack().idxmax()

('Cat', 'Zebra')

edited Nov 15 '16 at 14:51

answered Nov 15 '16 at 14:25

piRSquared

285,575
57
475
624

Python Pandas how to find top string which co-occurs?

2 Answers2