1

Here is my attempt at a word count for a single column using group by with pandas :

First setup the data :

columns = ['col1','col2','col3']
data = np.array([['word1','word2','word3'] , ['word1','word5','word3'], ['word3','word7','word3']])
to_count = pd.DataFrame(data,columns=columns)

I'm attempting to count words in col1 in to_count.

to_count contains :

    col1   col2   col3
0  word1  word2  word3
1  word1  word5  word3
2  word3  word7  word3

To count the words I then use :

print(to_count.groupby('col1').count())

which displays :

col2  col3
col1             
word1     2     2
word3     1     1

This seems partly correct in that the word counts are returned but they are spread across multiple columns. How to access word count for a single column ? I could just access a single column in the word count dataframe but this does not seem correct.

blue-sky
  • 51,962
  • 152
  • 427
  • 752

3 Answers3

2

If I understand you correctly, I think this is what you're looking for:

print(to_count.groupby('col1')['col1'].count())

Output:

       col1
word1    2
word3    1
Joe T. Boka
  • 6,554
  • 6
  • 29
  • 48
1

You can apply value_counts() fn to one column of dataframe. Following applies it all columns one by one:

for onecol in to_count:
    print(onecol, ":\n", to_count[onecol].value_counts())

Output:

col1 :
word1    2
word3    1
Name: col1, dtype: int64
col2 :
word5    1
word2    1
word7    1
Name: col2, dtype: int64
col3 :
word3    3
Name: col3, dtype: int64
rnso
  • 23,686
  • 25
  • 112
  • 234
1

How about this:

Single column:

df['col1'].value_counts()

will return:

word1    2
word3    1

All columns:

df.apply(lambda col: col.value_counts()).fillna(0).astype(int)

will return:

       col1  col2  col3
word1     2     0     0
word2     0     1     0
word3     1     0     3
word5     0     1     0
word7     0     1     0

Copy & paste example:

from io import StringIO
import pandas as pd

data = """
    col1   col2   col3
0  word1  word2  word3
1  word1  word5  word3
2  word3  word7  word3
"""

df = pd.read_table(StringIO(data), sep='\s+')

print(df['col1'].value_counts())
print(df.apply(lambda col: col.value_counts().astype(int)).fillna(0).astype(int))
Stefan Falk
  • 23,898
  • 50
  • 191
  • 378