4

I have a pandas data frame counted and grouped by specific columns.

import pandas as pd
df = pd.DataFrame({'x':list('aaabbbbbccccc'),'y':list('2225555577777'), 'z':list('1312223224432')})
#
df.groupby(['x','y','z'])['z'].count()
# or
df.groupby(['x','y','z'])['z'].agg(['count'])
# or
df.groupby(['x','y','z'])['z'].count().reset_index(name='counts')

Results is;

   x  y  z  counts
0  a  2  1       2
1  a  2  3       1
2  b  5  2       4
3  b  5  3       1
4  c  7  2       2
5  c  7  3       1
6  c  7  4       2

How can I convert the result to following form?

   x  y 1 2 3 4
0  a  2 2 0 1 0
1  b  5 0 4 1 0
2  c  7 0 2 1 2
Sezen
  • 447
  • 1
  • 5
  • 17
  • 1
    This also lays out strategies for performing this task. Though not identical, it should prove helpful. https://stackoverflow.com/a/47152692/2336654 – piRSquared Mar 22 '18 at 17:27

3 Answers3

4

You will need to use unstack + reset_index:

(df.groupby(['x','y','z'])['z']
   .count()
   .unstack(-1, fill_value=0)
   .reset_index()
   .rename_axis(None, axis=1)
)

   x  y  1  2  3  4
0  a  2  2  0  1  0
1  b  5  0  4  1  0
2  c  7  0  2  1  2

Note, you can replace df.groupby(['x','y','z'])['z'].count() with df.groupby(['x','y','z']).size() for compactness, but beware that size also counts NaNs.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    You can replace `count` with `size` and the `groupby` reference to `['z']` becomes unnecessary. – piRSquared Mar 22 '18 at 17:06
  • 1
    @piRSquared Oops, sorry, just saw this thread. Yes, thank you. Will modify. – cs95 Mar 22 '18 at 17:22
  • Thanks for the answers and comments. I accept this as answer because this is approximately 8 times faster than Wen's crosstab and 4 times faster than piRSquared's method. – Sezen Mar 23 '18 at 00:15
  • @Sezen Thank you for the fair arbitration. – cs95 Mar 23 '18 at 00:56
4

Something like crosstab

pd.crosstab([df.x,df.y],df.z).reset_index()
Out[81]: 
z  x  y  1  2  3  4
0  a  2  2  0  1  0
1  b  5  0  4  1  0
2  c  7  0  2  1  2
BENY
  • 317,841
  • 20
  • 164
  • 234
3

PROJECT/KILL <-- (read: project overkill)


This uses Pandas factorize to get integer representations of unique values. The function pd.factorize returns the passed array in its new integer form as well as an array of what the unique values were.

If we do this for two arrays whose positions correspond to each other, we can perform a cross tabulation using Numpy's bincount.

Bin counting simply increments a "bin" each time a value is encountered that represents the "bin". np.bincount assumes the bins are array indices from 0:. So this starts to make sense if I want to bin count a single array of integers. But how do I handle a two dimensional array? I have to shift my integer values such that they start counting in a "new row". I also have to figure out what "new row" that is. The integer values from the unique row values represent the row. The number of unique column values represent the "shift"

tups = list(zip(df.x, df.y))
i, r = pd.factorize(tups)
j, c = pd.factorize(df.z)
n, m = len(r), len(c)
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)

pd.DataFrame(
    np.column_stack([r.tolist(), b]),
    columns=['x', 'y'] + c.tolist()
)

   x  y  1  3  2  4
0  a  2  2  1  0  0
1  b  5  0  1  4  0
2  c  7  0  1  2  2

With sorting the z's

Notice that I used Pandas factorize. I could have used Numpy's unique to do some of this. However, there are two reasons why I didn't. One, np.unique sorts values by default when returning the inverse array (that's what gives me the factorization). Sorting has a time complexity of O(n * log(n)) and can be a hit on performance for larger arrays. The second reason is that Numpy would require some additional annoying handling to perform the task on tuples.

However, in this case, I wanted to see the z columns presented in the same order OP had them. That required that I sort while factorizing. I still didn't want to use Numpy so I just used the sort flag in pd.factorize

tups = list(zip(df.x, df.y))
i, r = pd.factorize(tups)
j, c = pd.factorize(df.z, sort=True)
n, m = len(r), len(c)
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)

pd.DataFrame(
    np.column_stack([r.tolist(), b]),
    columns=['x', 'y'] + c.tolist()
)

   x  y  1  2  3  4
0  a  2  2  0  1  0
1  b  5  0  4  1  0
2  c  7  0  2  1  2
piRSquared
  • 285,575
  • 57
  • 475
  • 624