Count how many times a column contains a certain value in Pandas

Question

Let's say my dataframe looks like this:

   column_name
1  book
2  fish
3  icecream|book
4  fish
5  campfire|book

Now, if I use df['column_name'].value_counts() it will tell me fish is the most frequent value.

However, I want book to be returned, since row 1, 3 and 5 contain the word 'book'.

I know .value_counts() recognizes icecream|book as one value, but is there a way I can determine the most frequent value by counting the amount of times each column cell CONTAINS a certain value, so that 'book' will the most frequent value?

Technically, as you've stated your question, it should return either "o", since that's the most frequent value, or "i", since that's the value that the most rows contains. If you want python to count word frequencies, you have to tell it what a "word" is, i.e. what the separator is. — Acccumulation, Jun 25 '18 at 16:08

score 6 · Accepted Answer · answered Jun 25 '18 at 15:53

6

Use split with stack for Series:

a = df['column_name'].str.split('|', expand=True).stack().value_counts()
print (a)
book        3
fish        2
icecream    1
campfire    1
dtype: int64

Or Counter with list comprehension with flattening:

from collections import Counter

a = pd.Series(Counter([y for x in df['column_name'] for y in x.split('|')]))
print (a)
book        3
fish        2
icecream    1
campfire    1
dtype: int64

answered Jun 25 '18 at 15:53

jezrael

822,522
95
1,334
1,252

How about a changing seperator? For example: `0::icecream||1::campfire||2::fish` instead of just `icecream|campfire` ? – Boomer Jun 25 '18 at 16:38
1

@Boomer Then use [this](https://stackoverflow.com/q/44000278/2901002) solution and change extract to findall instead split. – jezrael Jun 25 '18 at 16:50
1

@Boomer - for me working `pd.Series(Counter([y for x in df['column_name'].str.findall('(?<=\:\:)\w+(?=||)') for y in x]))` – jezrael Jun 25 '18 at 16:57
1

Thanks for helping me out! Having such an awesome community, makes programming even more fun :) – Boomer Jun 25 '18 at 17:07

piRSquared · Answer 2 · 2018-06-25T16:14:28.957

`pd.value_counts`

You can also pass a list to the value_counts function. Note I join by | then split by |.

pd.value_counts('|'.join(df.column_name).split('|'))

book        3
fish        2
icecream    1
campfire    1
dtype: int64

`get_dummies`

This works because your data is structured with | as the separator. If you had a different separator, pass it to the get_dummies call df.column_name.str.get_dummies(sep='|').sum()

df.column_name.str.get_dummies().sum()

book        3
campfire    1
fish        2
icecream    1
dtype: int64

If you want the results sorted

df.column_name.str.get_dummies().sum().sort_values(ascending=False)

book        3
fish        2
icecream    1
campfire    1
dtype: int64

`pd.factorize` and `np.bincount`

Note that I join the entire column and split again.

f, u = pd.factorize('|'.join(df.column_name).split('|'))
pd.Series(np.bincount(f), u)

book        3
fish        2
icecream    1
campfire    1
dtype: int64

To sort, we can use sort_values as we did above. Or this

f, u = pd.factorize('|'.join(df.column_name).split('|'))
counts = np.bincount(f)
a = counts.argsort()[::-1]
pd.Series(counts[a], u[a])

book        3
fish        2
campfire    1
icecream    1
dtype: int64

score 2 · Answer 3 · answered Jun 25 '18 at 16:00

2

Using collections.Counter + itertools.chain:

from collections import Counter
from itertools import chain

c = Counter(chain.from_iterable(df['column_name'].str.split('|')))

res = pd.Series(c)

print(res)

book        3
campfire    1
fish        2
icecream    1
dtype: int64

answered Jun 25 '18 at 16:00

jpp

159,742
34
281
339

I like `chain` + 1 from me (-: – piRSquared Jun 25 '18 at 16:06

Count how many times a column contains a certain value in Pandas

3 Answers3

pd.value_counts

get_dummies

pd.factorize and np.bincount

`pd.value_counts`

`get_dummies`

`pd.factorize` and `np.bincount`