Cross tabulate counts between pairs of keywords per group with pandas

Question

I have a table with keywords associated with articles, looks like this:

article_id  keyword
1           A
1           B
1           C
2           A
2           B
2           D
3           E
3           F
3           D

I need to get a sort of a pivot table:

    A   B   C   D   E   F
A   -   2   1   1   0   0
B   -   -   1   1   0   0
C   -   -   -   0   0   0
D   -   -   -   -   1   1
E   -   -   -   -   -   1
F   -   -   -   -   -   -

It means, that the pair (A, B) occurs in two articles (#1 and #2), the pair (A, C) occurs in just one article (#1), etc.

What is the most Pythonic way to do that?

I tried Pandas pivot tables, but with no success so far. Just can't get how to connect the keywords and article ids.

This question Create adjacency matrix for two columns in pandas dataframe doesn't solve the problem.

score 8 · Accepted Answer · answered Dec 16 '18 at 13:20

8

Use crosstab and dot. You can then use np.triu to retain only the upper half of the matrix (everything else is set to 0).

u = pd.crosstab(df.article_id, df.keyword)
v = u.T.dot(u)
pd.DataFrame(np.triu(v, k=1), index=v.index.values, columns=v.columns.values)

   A  B  C  D  E  F
A  0  2  1  1  0  0
B  0  0  1  1  0  0
C  0  0  0  0  0  0
D  0  0  0  0  1  1
E  0  0  0  0  0  1
F  0  0  0  0  0  0

Alternatively, for the last step, you can set invalid values to "-1", as a better alternative to "-" for invalid values.

v.values[np.tril_indices_from(v)] = -1
print(v)

keyword  A  B  C  D  E  F
keyword                  
A       -1  2  1  1  0  0
B       -1 -1  1  1  0  0
C       -1 -1 -1  0  0  0
D       -1 -1 -1 -1  1  1
E       -1 -1 -1 -1 -1  1
F       -1 -1 -1 -1 -1 -1

answered Dec 16 '18 at 13:20

cs95

379,657
97
704
746

1

@Dark Good question like this had been sitting without an answer for 50 minutes... I have never seen that before :) – cs95 Dec 16 '18 at 13:33
Yeah thought it was good chance to answer, got the solution came back uff there lies a great answer already. – Bharath M Shetty Dec 16 '18 at 13:33
Also, thanks for editing the question header! Sounds much better now. – Ildar Akhmetov Dec 17 '18 at 06:46
@IldarAkhmetov my pleasure. – cs95 Dec 17 '18 at 07:13

ayorgo · Answer 2 · 2018-12-16T15:21:22.993

You can also do it either using merge and crosstab

df_merge = df.merge(df, on='article_id')
pd.crosstab(df_merge['keyword_x'], df_merge['keyword_y'])

or merge and pivot_table

df_merge = df.merge(df, on='article_id')
df_merge.pivot_table('article_id', 'keyword_x', 'keyword_y', 'count', 0)

both resulting in

keyword_y  A  B  C  D  E  F
keyword_x                  
A          2  2  1  1  0  0
B          2  2  1  1  0  0
C          1  1  1  0  0  0
D          1  1  0  2  1  1
E          0  0  0  1  1  1
F          0  0  0  1  1  1

Bharath M Shetty · Answer 3 · 2018-12-16T13:37:17.320

You can use product over groups and use for loops to increment the count i.e

from itertools import product 

df2 = pd.DataFrame(columns=df['keyword'].unique(),index=df['keyword'].unique()).fillna(0)


for i in df.groupby('article_id')['keyword'].apply(lambda x : product(x,x)).values:
    for k,l in i:
        if k==l:
            df2.loc[k,l]='-'
        elif df2.loc[k,l]!=0:
            df2.loc[k,l]+=1
        else:
            df2.loc[k,l]=1

df2 = df2.where((df2=='-').cumsum().T.astype(bool),'-')

   A  B  C  D  E  F
A  -  2  1  1  0  0
B  -  -  1  1  0  0
C  -  -  -  0  0  0
D  -  -  -  -  1  1
E  -  -  -  -  -  1
F  -  -  -  -  -  -

Cross tabulate counts between pairs of keywords per group with pandas

3 Answers3