Count hashtag frequency in a dataframe

Question

I am trying to count the frequency of hashtag words in the 'text' column of my dataframe.

index        text
1            ello ello ello ello #hello #ello
2            red green blue black #colours
3            Season greetings #hello #goodbye 
4            morning #goodMorning #hello
5            my favourite animal #dog

word_freq = df.text.str.split(expand=True).stack().value_counts()

The above code will perform a frequency count on all strings in the text column, but I just to return the hashtag frequencies.

For example after running the code on my dataframe above, it should return

#hello        3
#goodbye      1
#goodMorning  1
#ello         1
#colours      1
#dog          1

Is there a way of slightly re-jigging my word_freq code so it only counts hashtag words and returns them in the way I put above? Thanks in advance.

Please include [`Minimal, Reproducible Example`](https://stackoverflow.com/a/20159305/4985099) — sushanth, Aug 03 '20 at 11:18
did you try to filter words in cells and keep only words which starts with `#` ? — furas, Aug 03 '20 at 14:40
Welcome to SO. The rules require you to show you tried to adapt the code yourself, and post a [Minimal, Complete, Verifiable Example](https://stackoverflow.com/help/minimal-reproducible-example). This has no MCVE. You can't just post a spec of what code you want written for you. — smci, Aug 03 '20 at 15:37

score 2 · Accepted Answer · edited Oct 10 '22 at 08:02

Use Series.str.findall on column text to find all hashtag words then use Series.explode + Series.value_counts:

counts = df['text'].str.findall(r'(#\w+)').explode().value_counts()

Another idea using Series.str.split + DataFrame.stack:

s = df['text'].str.split(expand=True).stack()
counts = s[lambda x: x.str.startswith('#')].value_counts()

Result:

print(counts)
#hello          3
#dog            1
#colours        1
#ello           1
#goodMorning    1
#goodbye        1
Name: text, dtype: int64

score 1 · Answer 2 · answered Aug 03 '20 at 11:58

one way using str.extractall that would remove the # from the result. Then value_counts as well

s = df['text'].str.extractall('(?<=#)(\w*)')[0].value_counts()
print(s)
hello          3
colours        1
goodbye        1
ello           1
goodMorning    1
dog            1
Name: 0, dtype: int64

score 0 · Answer 3 · answered Aug 03 '20 at 12:48

A slightly detailed solution but this does the trick.

dictionary_count=data_100.TicketDescription.str.split(expand=True).stack().value_counts().to_dict()

dictionary_count={'accessgtgtjust': 1,
'sent': 1,
'investigate': 1,
'edit': 1,
'#prd': 1,
'getting': 1}

ert=[i for i in list(dictionary_count.keys()) if '#' in i]

ert
Out[238]: ['#prd']

unwanted = set(dictionary_count.keys()) - set(ert)

for unwanted_key in unwanted: 
   del dictionary_count[unwanted_key]

dictionary_count
Out[241]: {'#prd': 1}

Count hashtag frequency in a dataframe

3 Answers3

Linked