Count unique pairs and store counts in a matrix

Question

My question is similar to stackoverflow.com/q/7549410

I have paired data which look like this:

I want to count the unique pairs and store those frequency counts in a matrix like this:

     10   20   30
1 |   0    2    0
3 |   1    0    0
4 |   0    0    1

Alternatively, if it's known that ID takes values in {1, 2, 3, 4} while ATTR in {0, 10, 20, 30} then I want a matrix as such:

     0   10   20   30
1 |  0    0    2    0
2 |  0    0    0    0
3 |  0    1    0    0
4 |  0    0    0    1

Question: What's the fastest way to do both of them in Python or NumPy?

I have tried using Pandas but I get an empty DataFrame:

import numpy as np
import pandas as pd
x = pd.DataFrame([[3, 10], [1, 20], [1, 20], [4, 30]])
x.pivot_table(index = 0, columns = 1, fill_value = 0, aggfunc = 'sum')

`x.pivot_table(index=0, columns=1, values=1, aggfunc='size', fill_value=0)` — cs95, Jan 23 '18 at 02:55
@cᴏʟᴅsᴘᴇᴇᴅ Pls see my updated question. It's not a duplicated question since there are 2 kinds of matrices I'm seeking. — visitor, Jan 23 '18 at 02:55

cs95 · Accepted Answer · 2018-01-23T03:07:04.293

It looks like you want to perform a cross tabulation, followed by a reindexing operation. For the cross tabulation, there are many ways to skin a cat.

First, using pivot_table -

v = x.pivot_table(
      index=0, 
      columns=1, 
      values=1, 
      aggfunc='size', 
      fill_value=0
)

Or, pd.crosstab -

v = pd.crosstab(x[0], x[1])

Or, set_index + get_dummies + sum(level=0)

v = pd.get_dummies(x.set_index(0)[1]).sum(level=0)

Or, get_dummies + dot -

v = pd.get_dummies(x[0]).T.dot(pd.get_dummies(x[1]))

v

   10  20  30
1   0   2   0
3   1   0   0
4   0   0   1

Next, call reindex on v -

v.reindex(index=range(1, 5), columns=range(0, 40, 10), fill_value=0)

1  0   10  20  30
0                
1   0   0   2   0
2   0   0   0   0
3   0   1   0   0
4   0   0   0   1

score 1 · Answer 2 · answered Jan 23 '18 at 03:26

1

You can using category

df.ID=df.ID.astype('category',categories=[1,2,3,4])
df.ATTR=df.ATTR.astype('category',categories=[0,10,20,30])

pd.crosstab(df.ID,df.ATTR)
Out[1143]: 
ATTR  0   10  20  30
ID                  
1      0   0   2   0
2      0   0   0   0
3      0   1   0   0
4      0   0   0   1

answered Jan 23 '18 at 03:26

BENY

317,841
20
164
234

Categories is just as clever, and would work just as well for strings. Nicely done. – cs95 Jan 23 '18 at 03:28

Count unique pairs and store counts in a matrix

2 Answers2

Linked