1

Suppose I have the following data:

import pandas as pd

df = pd.DataFrame([
    ['01', 'A'],
    ['01', 'B'],
    ['01', 'C'],
    ['02', 'A'],
    ['02', 'B'],
    ['03', 'B'],
    ['03', 'C']
], columns=['id', 'category'])

How do I create a frequency matrix like this?

    A   B   C           
A   2   2   1
B   2   3   2
C   1   2   2

One way to do it is through self join:

result = df.merge(df, on='id')
pd.pivot_table(
    result,
    index='category_x',
    columns='category_y',
    values='id',
    aggfunc='count'
)

But this will make the data size very large, is there any efficient way to do it without using self join?

Edit My original post was closed for duplication of pivot_table. But pivot_table only accept different columns and index. In my case, I have only one category column. So

# Does not work
pivot_table(df, column='category', index='category', ...)

does not work.

wong.lok.yin
  • 849
  • 1
  • 5
  • 10
  • The duplicate does not only mention pivot_table. The accepted answer alone mentions at least 7 different ways of doing this. – Mark Rotteveel Apr 25 '23 at 10:42
  • @Mark Rotteveel Those 7 different ways are just alternative way to make a r * c pivot table. They only work if you have different `row` and `column`. But I have the same row and column. My expected output is more like a correlation matrix, but replace with frequency. – wong.lok.yin Apr 26 '23 at 01:52

1 Answers1

0

Here is one way to do it with combinations_with_replacement and Counter from Python standard library:

from collections import Counter
from itertools import combinations_with_replacement


pair_counts = Counter(
    df.groupby("id")
    .agg(list)
    .apply(lambda x: list(combinations_with_replacement(x["category"], 2)), axis=1)
    .sum()
)

new_df = pd.DataFrame()
for pair, count in pair_counts.items():
    new_df.at[pair[0], pair[1]] = count
    new_df.at[pair[1], pair[0]] = count

new_df = new_df.astype(int)

Then:

print(new_df)
# Output

   A  B  C
A  2  2  1
B  2  3  2
C  1  2  2
Laurent
  • 12,287
  • 7
  • 21
  • 37
  • Thx for your reply. But `pair_counts` still list out all the possibility. Is there more pandasic way to do it? – wong.lok.yin May 02 '23 at 02:21
  • Hi, not that I know of, sorry. But doesn't that avoid making the data size very large, though (it is only counting), which seemed to be your main concern? Cheers. – Laurent May 02 '23 at 13:04