4

I am trying to count the frequency of hashtag words in the 'text' column of my dataframe.

index        text
1            ello ello ello ello #hello #ello
2            red green blue black #colours
3            Season greetings #hello #goodbye 
4            morning #goodMorning #hello
5            my favourite animal #dog

word_freq = df.text.str.split(expand=True).stack().value_counts()

The above code will perform a frequency count on all strings in the text column, but I just to return the hashtag frequencies.

For example after running the code on my dataframe above, it should return

#hello        3
#goodbye      1
#goodMorning  1
#ello         1
#colours      1
#dog          1

Is there a way of slightly re-jigging my word_freq code so it only counts hashtag words and returns them in the way I put above? Thanks in advance.

smci
  • 32,567
  • 20
  • 113
  • 146
  • 1
    Please include [`Minimal, Reproducible Example`](https://stackoverflow.com/a/20159305/4985099) – sushanth Aug 03 '20 at 11:18
  • did you try to filter words in cells and keep only words which starts with `#` ? – furas Aug 03 '20 at 14:40
  • Welcome to SO. The rules require you to show you tried to adapt the code yourself, and post a [Minimal, Complete, Verifiable Example](https://stackoverflow.com/help/minimal-reproducible-example). This has no MCVE. You can't just post a spec of what code you want written for you. – smci Aug 03 '20 at 15:37

3 Answers3

2

Use Series.str.findall on column text to find all hashtag words then use Series.explode + Series.value_counts:

counts = df['text'].str.findall(r'(#\w+)').explode().value_counts()

Another idea using Series.str.split + DataFrame.stack:

s = df['text'].str.split(expand=True).stack()
counts = s[lambda x: x.str.startswith('#')].value_counts()

Result:

print(counts)
#hello          3
#dog            1
#colours        1
#ello           1
#goodMorning    1
#goodbye        1
Name: text, dtype: int64
Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
1

one way using str.extractall that would remove the # from the result. Then value_counts as well

s = df['text'].str.extractall('(?<=#)(\w*)')[0].value_counts()
print(s)
hello          3
colours        1
goodbye        1
ello           1
goodMorning    1
dog            1
Name: 0, dtype: int64
Ben.T
  • 29,160
  • 6
  • 32
  • 54
0

A slightly detailed solution but this does the trick.

dictionary_count=data_100.TicketDescription.str.split(expand=True).stack().value_counts().to_dict()

dictionary_count={'accessgtgtjust': 1,
'sent': 1,
'investigate': 1,
'edit': 1,
'#prd': 1,
'getting': 1}

ert=[i for i in list(dictionary_count.keys()) if '#' in i]

ert
Out[238]: ['#prd']

unwanted = set(dictionary_count.keys()) - set(ert)

for unwanted_key in unwanted: 
   del dictionary_count[unwanted_key]

dictionary_count
Out[241]: {'#prd': 1}