0

I am trying to obtain pairwise counts of two column variables using pandas. I have a dataframe of two columns in the following format:

col1 col2
a   e
b   g
c   h
d   f
a   g
b   h
c   f
d   e
a   f
b   g
c   g
d   h
a   e
b   e
c   g
d   h
b   h

What I would like to get as output would be the following matrix of counts, for e.g.:

  e f g h 
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2

I am getting totally confused with pandas iterating over columns, rows, indexes and such. Appreciate some guidance here.

2 Answers2

3

Pandas often has simple functions built in - in this case, you want crosstab:

pd.crosstab(dat['col1'], dat['col2'])

full code:

import pandas as pd
from io import StringIO

x = '''col1 col2
a   e
b   g
c   h
d   f
a   g
b   h
c   f
d   e
a   f
b   g
c   g
d   h
a   e
b   e
c   g
d   h
b   h'''

dat = pd.read_csv(StringIO(x), sep = '\s+')

pd.crosstab(dat['col1'], dat['col2'])
jeremycg
  • 24,657
  • 5
  • 63
  • 74
1

You're looking for a crosstab:

count_matrix = pd.crosstab(index=df["col1"], columns=df["col2"])

print(count_matrix)
col2  e  f  g  h
col1
a     2  1  1  0
b     1  0  2  2
c     0  1  2  1
d     1  1  0  2

If you don't like the column/index names in (e.g. still seeing "col1" and "col2"), then you can remove them with rename_axis:

count_matrix = count_matrix.rename_axis(index=None, columns=None)

print(count_matrix)
   e  f  g  h
a  2  1  1  0
b  1  0  2  2
c  0  1  2  1
d  1  1  0  2

If you want that all together in one snippet:

count_matrix = (pd.crosstab(index=df["col1"], columns=df["col2"])
                .rename_axis(index=None, columns=None))

Cameron Riddell
  • 10,942
  • 9
  • 19