84

I want to create a count of unique values from one of my Pandas dataframe columns and then add a new column with those counts to my original data frame. I've tried a couple different things. I created a pandas series and then calculated counts with the value_counts method. I tried to merge these values back to my original dataframe, but I the keys that I want to merge on are in the Index(ix/loc).

Color Value
Red   100
Red   150
Blue  50

I want to return something like:

Color Value Counts
Red   100   2
Red   150   2 
Blue  50    1
Asclepius
  • 57,944
  • 17
  • 167
  • 143
user2592989
  • 1,017
  • 1
  • 9
  • 4
  • 2
    This is popular question lately. See this question [here](http://stackoverflow.com/questions/17666075/python-pandas-groupby-result/17666287#17666287) which is almost identical to your situation. – bdiamante Jul 17 '13 at 20:19

8 Answers8

85
df['Counts'] = df.groupby(['Color'])['Value'].transform('count')

For example,

In [102]: df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]})

In [103]: df
Out[103]: 
  Color  Value
0   Red    100
1   Red    150
2  Blue     50

In [104]: df['Counts'] = df.groupby(['Color'])['Value'].transform('count')

In [105]: df
Out[105]: 
  Color  Value  Counts
0   Red    100       2
1   Red    150       2
2  Blue     50       1

Note that transform('count') ignores NaNs. If you want to count NaNs, use transform(len).


To the anonymous editor: If you are getting an error while using transform('count') it may be due to your version of Pandas being too old. The above works with pandas version 0.15 or newer.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks a lot. Very helpful. I've been trying to apply that to a larger DataFrame and keep on getting this error "ValueError: Wrong number of items passed 1, indices imply 4". – user2592989 Jul 17 '13 at 20:47
  • 2
    Try selecting only one column for transform i.e. df.groupby(['Color'])[].transform('count') – user1827356 Jul 17 '13 at 21:17
  • added to the cookbook : http://pandas.pydata.org/pandas-docs/dev/cookbook.html#grouping (docs will build tomorrow) – Jeff Jul 17 '13 at 21:47
  • 1
    Not sure this is the best way to do this, but df['new column name'] = df[['col1','col2']].groupby('col1').transform('count') seemed to fix the problem I had with passing the wrong number of items. – user2592989 Jul 18 '13 at 14:01
  • thanks @user2592989, I don't understand why but if you try to do this same thing but count the `Value` column instead (nvm this being a poor example), I get `ValueError: Wrong number of items passed 1, indices imply 2`. It is not clear why but this is done using `df['Counts'] = df.groupby(['Value', 'Color']).transform('count')`. – Steven C. Howell May 15 '15 at 01:16
  • I am using this on a DataFrame with several columns and it seems I have to enter every column label. Is this the correct implementation of this method? It seem inefficient/tedious. – Steven C. Howell May 15 '15 at 01:27
  • You can just do the count on one column (eg `df.groupby(['Color'])['certain_col'].transform('count')`). Or use `size` instead of count (but this will count NaN values as well). If you still have problems with it, please open a new question. – joris May 15 '15 at 08:06
  • just for information, the groupby is dropping away the null values, so if you try to do a transform on a column which has null, the number of return value will be lower than expected. You have to ensure that the column on which you group doesn't have any null to do a group by. this can be checked by :df = df [ pd.notnull(df['col']) ] – Romain Jouin May 18 '15 at 22:14
  • I tried as well and I got the error: ValueError: Wrong number of items passed 1, indices imply 2. However, after slicing my df and leaving only 2 columns (as in the answer) it worked. Any idea why and how to perform this with more than 2 columns? – Blue Moon Aug 04 '15 at 12:22
  • @johnred: Specify the column you wish to count before calling `transform`: `df.groupby(['Color'])['Value'].transform('count')`. See [DataFrame column selection in GroupBy](http://pandas.pydata.org/pandas-docs/stable/groupby.html#dataframe-column-selection-in-groupby). – unutbu Aug 04 '15 at 12:58
  • Why the cost time of `transform(len) ` is much slower than `transform('count')` or `transform('size')` ? For me, it costs 7min and 5s, respectively. – YANGSU LIU May 19 '21 at 15:03
  • The requested calculation should depend only on the column `'Color'`. However you are making a reference to `'Value'`. Is there an elegant way to do this without referencing extra columns? – hipoglucido Oct 24 '21 at 10:12
21

One other option:

z = df['Color'].value_counts 

z1 = z.to_dict() #converts to dictionary

df['Count_Column'] = df['Color'].map(z1) 

This option will give you a column with repeated values of the counts, corresponding to the frequency of each value in the 'Color' column.

bharatk
  • 4,202
  • 5
  • 16
  • 30
ZakS
  • 1,073
  • 3
  • 15
  • 27
  • 21
    This can be simplified to: `df['Count_Column'] = df['Color'].map(df['Color'].value_counts())`. You can use a series to map (doesn't have to be a dict) – sacuL Jul 13 '18 at 20:15
13

This answer uses Series.map with Series.value_counts. It was tested with Pandas 1.1.

df['counts'] = df['attribute'].map(df['attribute'].value_counts())

Credit: comment by sacuL

Asclepius
  • 57,944
  • 17
  • 167
  • 143
5

df['Counts'] = df.Color.groupby(df.Color).transform('count')

You can do this with any series: group it by itself and call transform('count'):

>>> series = pd.Series(['Red', 'Red', 'Blue'])
>>> series.groupby(series).transform('count')
0    2
1    2
2    1
dtype: int64
1''
  • 26,823
  • 32
  • 143
  • 200
3

My initial thought would be to use list comprehension as shown below but, as was pointed out in the comment, this is slower than the groupby and transform method. I will leave this answer to demonstrate WHAT NOT TO DO:

In [94]: df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]})
In [95]: df['Counts'] = [sum(df['Color'] == df['Color'][i]) for i in xrange(len(df))]
In [96]: df
Out[100]: 
  Color  Value  Counts
0   Red    100       2
1   Red    150       2
2  Blue     50       1

[3 rows x 3 columns]

@unutbu's method gets complicated for DataFrames with several columns which make this simpler to code. If you are working with a small data frame, this is faster (see below), but otherwise, you should use NOT use this.

In [97]: %timeit df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]}); df['Counts'] = df.groupby(['Color']).transform('count')
100 loops, best of 3: 2.87 ms per loop
In [98]: %timeit df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]}); df['Counts'] = [sum(df['Color'] == df['Color'][i]) for i in xrange(len(df))]
1000 loops, best of 3: 1.03 ms per loop
Steven C. Howell
  • 16,902
  • 15
  • 72
  • 97
  • 4
    The example with 3 rows is very misleading with timing. Try it with a larger dataframe, and you will see that the groupby approach is *much* faster (I tried it with your `df` repeated 1000 times (` df = pd.concat([df]*1000, ignore_index=True)`) and get 3.6 ms (gropuby) vs 29 s (list comprehension)). Further, I think the groupby approach is simpler. – joris May 15 '15 at 08:41
0

Create a column containing the count of repeated values. The values are temporary calculations computed from other columns. Very fast. Credit to @ZakS.

sum_A_B = df['A']+df['B']
sum_A_B_dict = sum_A_B.value_counts().to_dict()
df['sum_A_B'] = sum_A_B.map(sum_A_B_dict) 
BSalita
  • 8,420
  • 10
  • 51
  • 68
0

While there are already plenty of great answer here, and I personally believe using:

(given a dataframe = df)

df['new_value_col'] = df.groupby('colname_to_count')['colname_to_count'].transform('count')

is one of the best and most straightforward options.. I wanted to provide another method that I have used successfully.

import pandas as pd
import numpy as np

df['new_value_col'] = df.apply(lambda row: np.sum(df['col_to_count'] == row['col_to_count'], axis=1)

Where we are essentially turning the column that we want to count from into a series within the lambda expression and then using np.sum to count the occurrences of each value within the series.

Thought this could be useful, never bad to have multiple options!

Jeff W
  • 5
  • 1
0

Used nunique command along with dropna to reduce NaN values. Also tested this in google collab.

 df = pd.DataFrame({'Color': ['Red', 'Red', 'Blue'], 'Value': [100, 150, 50]})
    total_counts = df.groupby('Color')['Value'].nunique(dropna=True)
    df['Counts'] = df['Color'].transform(lambda x: total_counts[x])
    print(df)

For more understanding of nunique Read this blog. Output:

enter image description here

Nimra Tahir
  • 391
  • 1
  • 6