0

Supposed I have a set of data with two labels, put in a pandas Dataframe:

    label1 label2
0        0      a
1        1      a
2        1      a
3        1      a
4        1      a
5        2      b
6        0      b
7        1      b
8        2      b
9        0      b
10       2      c
11       1      c
12       2      c
13       0      c
14       2      c

Using the following code, the number of elements for each combination of labels can be obtained:

grouped = df.groupby(['label1', 'label2'], sort = False)
grouped.size()

The result is something like this:

label1  label2
0       a         1
1       a         4
2       b         2
0       b         2
1       b         1
2       c         3
1       c         1
0       c         1
dtype: int64

However, I would also like to compare the distribution of data count for label 2 in each label 1 group. I imagine the most convenient way to further manipulate the data for this purpose would be having a Dataframe (or some sort of table) with label1/2 as rows/columns and content as data count, like this:

    a   b   c
0   1   2   1
1   4   1   1
2   0   2   3

After a while of search, to my surprise, there seems no easy way to do this kind of dataframe reshaping in pandas.

Using a loop is possible. But I assumed it would be super slow, since in the real data, there are hundreds of thousands of different labels.

Moreover, there seems no way to get a group from only label1, after grouping with both label1 and label2, so the loop will have to be on the combination of labels, which might make things even slower and more complicated.

Anyone knows a smart way to do this?

cclkevin
  • 3
  • 1
  • Thank you very much guys! Once I learnt that this type of operation is called pivot, it became obvious that there are actually many simple ways to this. So I guess my real question was I didn't know the keyword... – cclkevin Jun 17 '18 at 10:24
  • Both answers worked well, crosstab is slightly simpler – cclkevin Jun 17 '18 at 10:33

2 Answers2

0

Are you looking for pd.pivot_table?

df.pivot_table(index='label1', columns='label2', aggfunc='size').fillna(0)
stefanw
  • 10,456
  • 3
  • 36
  • 34
0

Probably crosstab:

pd.crosstab(df.label1, df.label2)
jxc
  • 13,553
  • 4
  • 16
  • 34