How to find the count of a word in a string

Question

I have a string "Hello I am going to I with hello am". I want to find how many times a word occur in the string. Example hello occurs 2 time. I tried this approach that only prints characters -

def countWord(input_string):
    d = {}
    for word in input_string:
        try:
            d[word] += 1
        except:
            d[word] = 1

    for k in d.keys():
        print "%s: %d" % (k, d[k])
print countWord("Hello I am going to I with Hello am")

I want to learn how to find the word count.

Depending on your use case, there's one more thing you might need to consider: some words have their meanings change depending upon their capitalization, like `Polish` and `polish`. Probably that won't matter for you, but it's worth remembering. — DSM, Jul 02 '12 at 20:28
Could you define you data set more for us, will you worry about punctuation such as in `I'll`, `don't` etc .. some of these raised in comments below. And differences in case? — Levon, Jul 02 '12 at 20:38

score 43 · Accepted Answer · answered Jul 02 '12 at 20:05

43

If you want to find the count of an individual word, just use count:

input_string.count("Hello")

Use collections.Counter and split() to tally up all the words:

from collections import Counter

words = input_string.split()
wordCount = Counter(words)

answered Jul 02 '12 at 20:05

Joel Cornett

24,192
9
66
88

Is collections module part of basic python installation? – Varun Jul 02 '12 at 20:13
1

I'm copying part of a comment by @DSM left for me since I also used `str.count()` as my initial solution - this has a problem since `"am ham".count("am")` will yield 2 rather than 1 – Levon Jul 02 '12 at 20:35
1

@Varun: I believe `collections` is in Python 2.4 and above. – Joel Cornett Jul 02 '12 at 23:32
@Levon: You're absolutely right. I believe using Counter, along with a regex word collector is probably the best option. Will edit answer accordingly. – Joel Cornett Jul 02 '12 at 23:33
1

Well .. credit goes to @DSM who made me aware of this in the first place (since I was using `str.count()` too) – Levon Jul 02 '12 at 23:37
why not just use len() instead of count()? words = input_string.split() ... wordCount = len(words) – Bimo Jun 27 '17 at 19:09

score 6 · Answer 2 · answered Jul 02 '12 at 20:05

6

from collections import *
import re

Counter(re.findall(r"[\w']+", text.lower()))

Using re.findall is more versatile than split, because otherwise you cannot take into account contractions such as "don't" and "I'll", etc.

Demo (using your example):

>>> countWords("Hello I am going to I with hello am")
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

If you expect to be making many of these queries, this will only do O(N) work once, rather than O(N*#queries) work.

answered Jul 02 '12 at 20:05

ninjagecko

88,546
24
137
145

2

+1 for re. `split` solutions won't work with phrases containing punctuations. – georg Jul 02 '12 at 20:35
This is the best answer for me +1 – Nahko Feb 25 '20 at 22:45

score 6 · Answer 3 · answered Jul 02 '12 at 20:05

6

Counter from collections is your friend:

>>> from collections import Counter
>>> counts = Counter(sentence.lower().split())

answered Jul 02 '12 at 20:05

Martijn Pieters

1,048,767
296
4,058
3,343

score 4 · Answer 4 · edited May 23 '17 at 11:54

The vector of occurrence counts of words is called bag-of-words.

Scikit-learn provides a nice module to compute it, sklearn.feature_extraction.text.CountVectorizer. Example:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             min_df = 0,          \
                             max_features = 50) 

text = ["Hello I am going to I with hello am"]

# Count
train_data_features = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features.toarray(), axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print count, tag

Output:

2 am
1 going
2 hello
1 to
1 with

Part of the code was taken from this Kaggle tutorial on bag-of-words.

FYI: How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

Levon · Answer 5 · 2012-07-02T20:43:54.310

3

Here is an alternative, case-insensitive, approach

sum(1 for w in s.lower().split() if w == 'Hello'.lower())
2

It matches by converting the string and target into lower-case.

ps: Takes care of the "am ham".count("am") == 2 problem with str.count() pointed out by @DSM below too :)

edited Jul 02 '12 at 20:43

answered Jul 02 '12 at 20:05

Levon

138,105
33
200
191

2

Using count by itself can lead to unexpected results, though: `"am ham".count("am") == 2`. – DSM Jul 02 '12 at 20:07
@DSM .. good point .. I'm not happy with this solution anyway since it's case sensitive, looking at an alternative right now ... – Levon Jul 02 '12 at 20:08

Ashwini Chaudhary · Answer 6 · 2012-07-02T20:22:32.630

2

Considering Hello and hello as same words, irrespective of their cases:

>>> from collections import Counter
>>> strs="Hello I am going to I with hello am"
>>> Counter(map(str.lower,strs.split()))
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

edited Jul 02 '12 at 20:22

answered Jul 02 '12 at 20:14

Ashwini Chaudhary

244,495
58
464
504

I would go with `Counter(strs.lower().split())`. Reduces some of the overhead for a faster runtime – inspectorG4dget Jul 02 '12 at 20:15
1

Isn't this just Martijn Pieters' solution now, though? – DSM Jul 02 '12 at 20:21
@DSM I somehow didn't saw his solution, updated my solution back to the original version. :) – Ashwini Chaudhary Jul 02 '12 at 20:23

Booharin · Answer 7 · 2020-01-23T13:05:13.077

1

You can divide the string into elements and calculate their number

count = len(my_string.split())

edited Jan 23 '20 at 13:05

answered Jan 23 '20 at 10:02

Booharin

753
10
10

score 0 · Answer 8 · answered Sep 09 '16 at 20:06

0

You can use the Python regex library re to find all matches in the substring and return the array.

import re

input_string = "Hello I am going to I with Hello am"

print(len(re.findall('hello', input_string.lower())))

Prints:

answered Sep 09 '16 at 20:06

ode2k

2,653
13
20

score 0 · Answer 9 · answered Nov 01 '18 at 19:45

0

def countSub(pat,string):
    result = 0
    for i in range(len(string)-len(pat)+1):
          for j in range(len(pat)):
              if string[i+j] != pat[j]:
                 break
          else:   
                 result+=1
    return result

answered Nov 01 '18 at 19:45

3

Hello, welcome to SO. Your answer contains only code. It would be better if you could also add some commentary to explain what it does and how. Can you please [edit] your answer and add it? Thank you! – Fabio says Reinstate Monica Nov 01 '18 at 21:51

How to find the count of a word in a string

9 Answers9

Linked

Related