How to merge pandas value_counts() to dataframe or use it to subset a dataframe

Question

I used pandas df.value_counts() to find the number of occurrences of particular brands. I want to merge those value counts with the respective brands in the initial dataframe.

 df has many columns including one named 'brands'
 brands = df.brands.value_counts()

 brand1   143
 brand2   21
 brand3   101
 etc.

How do I merge the value counts with the original dataframe such that each brand's corresponding count is in a new column, say "brand_count"?

Is it possible to assign headers to these columns; the names function won't work with series and I was unable to convert it to a dataframe to possibly merge the data that way. But, value_counts outputs a Series of dtype int64 (brand names should be type string) which means I cannot do the following:

 df2 = pd.DataFrame({'brands': list(brands_all[0]), "brand_count":
 list(brands_all[1])})
 (merge with df)

Ultimately, I want to obtain this:

 col1  col2  col3  brands  brand_count ... col150
                   A        30
                   C        140
                   A        30
                   B        111

score 15 · Answer 1 · answered Mar 05 '16 at 11:02

is that what you want:

import numpy as np
import pandas as pd

# generating random DataFrame
brands_list = ['brand{}'.format(i) for i in range(10)]
a = pd.DataFrame({'brands': np.random.choice(brands_list, 100)})
b = pd.DataFrame(np.random.randint(0,10,size=(100, 3)), columns=list('ABC'))
df = pd.concat([a, b], axis=1)
print(df.head())

# generate 'brands' DF
brands = pd.DataFrame(df.brands.value_counts().reset_index())
brands.columns = ['brands', 'count']
print(brands)

# merge 'df' & 'brands_count'
merged = pd.merge(df, brands, on='brands')
print(merged)

PS first big part is just a dataframe generation.

The part which is interesting for you starts with the # generate 'brands' DF comment

score 13 · Answer 2 · answered Apr 08 '19 at 15:24

13

i think the best way is to use map

df['brand_count']= df.brand.map(df.brand.value_counts())

this is so much faster than groupby method for example (factor 500 on a 15000 row df) and take only one line

answered Apr 08 '19 at 15:24

Egos

166
1
7

1

Super elegant and efficient solution! It also taught me that map() can work with a series too! Thanks :) – Kon Pal May 16 '19 at 15:18

score 9 · Answer 3 · answered Mar 05 '16 at 02:52

You want to use transform.

import numpy as np
import pandas as pd

np.random.seed(0)

# Create dummy data.
df = pd.DataFrame({'brands': ['brand{0}'.format(n) 
                   for n in np.random.random_integers(0, 5, 10)]})

df['brand_count'] = \
    df.groupby('brands', as_index=False)['brands'].transform(lambda s: s.count())

>>> df
   brands brand_count
0  brand4           1
1  brand5           2
2  brand0           1
3  brand3           4
4  brand3           4
5  brand3           4
6  brand1           1
7  brand3           4
8  brand5           2
9  brand2           1

For reference:

>>> df.brands.value_counts()
brand3    4
brand5    2
brand4    1
brand0    1
brand1    1
brand2    1
Name: brands, dtype: int64

This gives you the column with brands and their counts, brand_count, but I'd like to add the brand_count column to a df with many other columns. When I try this I lose the remaining columns — user2476665, Mar 05 '16 at 03:04
Are the source and target the same dataframe? If so, `df['brand_count'] = ...` will work. — Alexander, Mar 05 '16 at 03:07

score 1 · Answer 4 · answered Mar 26 '18 at 13:34

df = ...
key_col = "brand"
count_col = "brand_count"

result = (
    df.join(
        df[key_col].value_counts().rename(count_col), 
        how="left", 
        on=key_col)
)

If you need to join the counts to a different dataframe remember to fill NaNs with zeros:

df = ...
other = ...
key_col = "brand"
count_col = "brand_count"

result = (
    other.join(
        df[key_col].value_counts().rename(count_col), 
        how="left", 
        on=key_col)
    .fillna({count_col: 0})
)

score 0 · Answer 5 · answered Sep 12 '18 at 11:36

Pandas DataFrame's merge and value_counts attributes are pretty fast, so I would combine the two.

df.merge(df['brand'].value_counts().to_frame(), how='left', left_on='brand',
         right_index=True, suffixes=('', 'x'))\
  .rename(columns={'brandx':'brand_count'})

How to merge pandas value_counts() to dataframe or use it to subset a dataframe

5 Answers5

Linked