5

Let's say my dataframe looks like this:

   column_name
1  book
2  fish
3  icecream|book
4  fish
5  campfire|book

Now, if I use df['column_name'].value_counts() it will tell me fish is the most frequent value.

However, I want book to be returned, since row 1, 3 and 5 contain the word 'book'.

I know .value_counts() recognizes icecream|book as one value, but is there a way I can determine the most frequent value by counting the amount of times each column cell CONTAINS a certain value, so that 'book' will the most frequent value?

jpp
  • 159,742
  • 34
  • 281
  • 339
Boomer
  • 383
  • 2
  • 4
  • 15
  • 1
    Technically, as you've stated your question, it should return either "o", since that's the most frequent value, or "i", since that's the value that the most rows contains. If you want python to count word frequencies, you have to tell it what a "word" is, i.e. what the separator is. – Acccumulation Jun 25 '18 at 16:08

3 Answers3

6

Use split with stack for Series:

a = df['column_name'].str.split('|', expand=True).stack().value_counts()
print (a)
book        3
fish        2
icecream    1
campfire    1
dtype: int64

Or Counter with list comprehension with flattening:

from collections import Counter

a = pd.Series(Counter([y for x in df['column_name'] for y in x.split('|')]))
print (a)
book        3
fish        2
icecream    1
campfire    1
dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • How about a changing seperator? For example: `0::icecream||1::campfire||2::fish` instead of just `icecream|campfire` ? – Boomer Jun 25 '18 at 16:38
  • 1
    @Boomer Then use [this](https://stackoverflow.com/q/44000278/2901002) solution and change extract to findall instead split. – jezrael Jun 25 '18 at 16:50
  • 1
    @Boomer - for me working `pd.Series(Counter([y for x in df['column_name'].str.findall('(?<=\:\:)\w+(?=||)') for y in x]))` – jezrael Jun 25 '18 at 16:57
  • 1
    Thanks for helping me out! Having such an awesome community, makes programming even more fun :) – Boomer Jun 25 '18 at 17:07
4

pd.value_counts

You can also pass a list to the value_counts function. Note I join by | then split by |.

pd.value_counts('|'.join(df.column_name).split('|'))

book        3
fish        2
icecream    1
campfire    1
dtype: int64

get_dummies

This works because your data is structured with | as the separator. If you had a different separator, pass it to the get_dummies call df.column_name.str.get_dummies(sep='|').sum()

df.column_name.str.get_dummies().sum()

book        3
campfire    1
fish        2
icecream    1
dtype: int64

If you want the results sorted

df.column_name.str.get_dummies().sum().sort_values(ascending=False)

book        3
fish        2
icecream    1
campfire    1
dtype: int64

pd.factorize and np.bincount

Note that I join the entire column and split again.

f, u = pd.factorize('|'.join(df.column_name).split('|'))
pd.Series(np.bincount(f), u)

book        3
fish        2
icecream    1
campfire    1
dtype: int64

To sort, we can use sort_values as we did above. Or this

f, u = pd.factorize('|'.join(df.column_name).split('|'))
counts = np.bincount(f)
a = counts.argsort()[::-1]
pd.Series(counts[a], u[a])

book        3
fish        2
campfire    1
icecream    1
dtype: int64
piRSquared
  • 285,575
  • 57
  • 475
  • 624
2

Using collections.Counter + itertools.chain:

from collections import Counter
from itertools import chain

c = Counter(chain.from_iterable(df['column_name'].str.split('|')))

res = pd.Series(c)

print(res)

book        3
campfire    1
fish        2
icecream    1
dtype: int64
jpp
  • 159,742
  • 34
  • 281
  • 339