36

Suppose I have pandas data frame with 2 columns:

df: Col1  Col2
      1     1
      1     2
      1     2
      1     2
      3     4
      3     4

Then I want to keep only the unique couple values (col1, col2) of these two columns and give their frequncy:

df2: Col1  Col2  Freq
      1     1     1
      1     2     3
      3     4     2

I think to use df['Col1', 'Col2'].value_counts() but it works only for one column. Does it exist a function to deal with many columns?

curious_one
  • 523
  • 1
  • 6
  • 12
  • 6
    `df.groupby(['Col1', 'Col2']).size()`? – Zero Jul 04 '17 at 13:01
  • 1
    Ambiguous title: this does not find the unique values in **either** Col1 or Col2, but the **unique combinations of values in both** Col1 and Col2, i.e. the Cartesian product. This might not be what you want, esp, for columns with higher cardinality than boolean (only two values). – smci Apr 08 '20 at 21:33

2 Answers2

69

You need groupby + size + Series.reset_index:

df = df.groupby(['Col1', 'Col2']).size().reset_index(name='Freq')
print (df)
   Col1  Col2  Freq
0     1     1     1
1     1     2     3
2     3     4     2
iacob
  • 20,084
  • 6
  • 92
  • 119
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for the amazing answer. I'm trying to understand your answer by running it bit by bit and I have a couple of questions: 1. What shall I use if I only need the `Col1` and `Col2`, namely I only need the unique pairs of value for the first two columns, would your answer still be the most optimal method? 2. Why does `df.groupby(['Col1', 'Col2']).size()` return data series for me? Thanks again. – Bowen Liu Sep 19 '18 at 19:52
  • @BowenLiu - 1. I think it is really fast, maybe some numpy solution should be faster. 2. In my opinion it return Series by design - there is not necessary another column like aggregating `mean`, `sum` (`df.groupby(['Col1', 'Col2'])['Col3'].sum()`), because output is counted by columns define in `groupby` - `Col1` and `Col3` - it grouping and also count in same columns. For sum it grouping by `Col1` and `Col2` and aggregate `Col3` - column(s) in list after `groupby` or if omited like `df.groupby(['Col1', 'Col2']).sum()` it aggregate sum in all columns. – jezrael Sep 20 '18 at 05:08
  • Can't believe I didn't see your reply. Reading it after using pandas for several months makes a lot more sense for me now. The only thing I still don't get is the `reset_index(name = 'Freq')` part. In the pandas documentation, name is not a kwarg for `reset_index`. How did you get name the column that was not an index in the groupby result in this way? Thanks. – Bowen Liu Mar 20 '19 at 12:52
  • @BowenLiu - oops, there is bad link, need [`Series.reset_index`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reset_index.html) - `name` parameter working only with `Series` – jezrael Mar 20 '19 at 12:55
  • Thanks a lot! I forgot that the groupby result would be a series myself. – Bowen Liu Mar 20 '19 at 13:29
  • @BowenLiu - it depends, but after `.size` it is `Series` – jezrael Mar 20 '19 at 13:29
  • Revisiting this to familiarize with Pandas again as I haven't used it for a couple of months: in your Sep20, 18 comment: "because output is counted by columns define in groupby - Col1 and Col3", do you mean Col1 and Col2, the columns by which you grouped by? Just want to understand more of the design of groupby. – Bowen Liu Sep 17 '20 at 19:03
  • So can I understand that the only circumstances where a `groupby` can return a data frame are when the aggregate function operates in columns (or column) other than the columns it groups by? Could you correct my assumption if it's wrong please? – Bowen Liu Sep 17 '20 at 19:08
  • @Bowen Liu, yes, you are right, it is col1 and col2, not col3. Idea of my comment is `groupy. size()` function counts groups defined by `by` or `level` parameter. It return always Series, with single index if use one column like `df.groupby(['Col1']).size()` or by MultiIndex Series if use multiple columns like `df.groupby(['Col1', 'Col2']).size()`. So not are processes columns defined after groupy like col3 here `df.groupby(['Col1', 'Col2'])['Col3'].sum()`. – jezrael Sep 17 '20 at 20:09
  • 1
    Thanks a lot. I realize my understanding of `group` is quite superficial therefore am trying to deduct some general rules of it. Are there any other aggregate functions, like `.size()`, that can generate a series without specifying columns (in the format of `df.groupby(['Col1']).function()`). BTW many of your posts have proved immensely helpful to me. I wonder if you can share how you manage to have such a deep and systematic understanding of Pandas. – Bowen Liu Sep 17 '20 at 21:50
15

You could try

df.groupby(['Col1', 'Col2']).size()

for a different visual output in comparison to jez's answer, you can extend that solution with

pd.DataFrame(df.groupby(['Col1', 'Col2']).size().rename('Freq'))

gives

           Freq
Col1 Col2      
1    1        1
     2        3
3    4        2
Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42