count words in a string without using split

Question

I have a problem on which I am working where I need to count the number of words in a string without using the split() function in Python. I thought of an approach where I can take a variable word=0 and increment it every time there's an empty space in the string, but it doesn't seems to work as it always gave a count less than the actual count.

s="the sky is blue"

def countW(s):
    print(s)
    word=0
    for i in s:
        if i==" ":
            word=word+1
    print(word)
countW(s)

I know it's a simple question but I am struggling to understand what else I can keep into account to make sure I get the right count. The second approach I was thinking of involves too much for loop and array creation and then back string conversion. Can anyone point me to a simpler approach, where I don't increase the time complexity for this.

def countW(s): print(s) word=0 for i in s: if i==" ": word=word+1 print(word) countW(s) — Faith lost, Feb 16 '17 at 19:22
Possible duplicate of [Count occurrence of a character in a string](http://stackoverflow.com/questions/1155617/count-occurrence-of-a-character-in-a-string) — All Іѕ Vаиітy, Feb 16 '17 at 19:25
counting a string using split is ineffective because it creates a list. It's better to count spaces and add 1. — Jean-François Fabre, Feb 16 '17 at 19:25
Actually i tried this, but it's counting occurrence of each word, i don't want to do that i can use hash and do it, i want count of all words in a string without using any library or functions. I know its simple but i am missing to understand what all factor i need to consider. Ex: "sky is blue" should have O/P of 3 — Faith lost, Feb 16 '17 at 19:26
then your "the sky is blue" is a bad example. There are no repeated words. — Jean-François Fabre, Feb 16 '17 at 19:26
@Jean-FrançoisFabre hi i tried that only but i am not sure if it will fit all edge cases. — Faith lost, Feb 16 '17 at 19:27
If you are concerned about poorly formatted strings, wrong grammar etc. this question is clearly underspecified. In a general sense it is probably not even possible. For example consider: "You are good-looking" vs. "This is good-looking into the sky" How should the computer determine that the first one is a compound adjective while the second one consists of two sentences which are poorly linked? — koalo, Feb 16 '17 at 20:00

score 2 · Answer 1 · answered Feb 17 '17 at 08:23

2

You could also use itertools.groupby, grouping by whether the characters are alpha-numeric or not, and summing all the values (True equaling 1).

>>> s = "the sky is blue"
>>> sum(k for (k, g) in itertools.groupby(s, key=str.isalnum))
4

answered Feb 17 '17 at 08:23

tobias_k

81,265
12
120
179

score 2 · Answer 2 · answered Feb 17 '17 at 08:42

The simplest finite automata with states - inside a word or outside. Pseudocode:

InsideWord = false
Count = 0
for c in s
    if c is not letter
               InsideWord = false 
    else
         if not InsideWord
               Count++
               InsideWord = true

koalo · Accepted Answer · 2017-02-17T08:16:40.877

1

Counting the number of spaces is a good approach and works most of the time. Of course you have to add 1 to get the correct number of words.

However, since you seem to be concerned about poorly formatted strings, you have to consider multiple whitespaces, whitespaces at the beginning and the end as well as punctuation.

If you do not want to use regular expressions (as in Ezsrac's answer), here is an alternative that considers combinations of characters, numbers and the underscore as word, just like \w does. It simply counts all transitions between word characters and non-word characters. The end requires special attention to consider non-word characters at the end (for example "a a " vs. "a a").

def is_word_character(c):
    return 'a' <= c <= 'z' or 'A' <= c <= 'Z' or '0' <= c <= '9' or c == '_'

def word_count(str):
    c = 0
    for i in range(1, len(str)):
        if not is_word_character(str[i]) and is_word_character(str[i-1]):
            c += 1
    if is_word_character(str[-1]):
        c += 1
    return c

Here are some test cases:

>>> word_count("the sky is blue")
4
>>> word_count("the sky is blue.The")
5
>>> word_count(" the sky is   blue ")
4
>>> word_count(" the sky is   blue\nand not green ")
7

If you also want to include other characters you can simply extend the is_word_character function, but be aware that it is not possible to consider all corner cases without using very advanced techniques. For example, consider "You are good-looking" vs. "This is good-looking into the sky". It is not possible for such a simple program to recognize that the first one is a compound adjective while the second one consists of two sentences which are poorly linked.

edited Feb 17 '17 at 08:16

answered Feb 16 '17 at 19:24

koalo

2,113
20
31

I tried that only but i am just wondering if it will fit all edge cases or if their's a better approach then this, avoiding use of Python direct functions like count() or split() – Faith lost Feb 16 '17 at 19:29
@Faith lost Since you are concerned about edge cases, I added another solution – koalo Feb 16 '17 at 19:36
well the code you mentioned will fail in a case where someone has entered a poorly formatted string. Ex: str="the sky is blue.The" – Faith lost Feb 16 '17 at 19:38
You should surely refine your requirements in the question! What is a WORD for you? – koalo Feb 16 '17 at 19:40
Keep in mind that a word can be separated by a slew of interpunction and whitespace characters so checking against space only is not really recommended. I'd create a set like `word_boundaries = set([" ", "\n", ".", ",", "!", ":"])` and then use if `if str[i] in word_boundaries...` to iron out the edge cases. – zwer Feb 16 '17 at 20:11
@zwer as a first step that is a good idea, but it will not handle all cases (see my comment to the question) – koalo Feb 16 '17 at 20:32
It will never handle all cases, you'd need NLP and dictionary processing for that, but it will handle most cases in the English language. Besides, that's precisely what regex does with the `\b` class (or `\w` for a whole word match), this is just a way to do the same without external modules that OP for some reason didn't want. – zwer Feb 16 '17 at 20:38
@zwer The reference to `\w` is a very good hint. It have extended the answer accordingly. – koalo Feb 17 '17 at 08:18
Thanks a lot, everything you have explained above is really helpful. – Faith lost Feb 17 '17 at 20:38

score 0 · Answer 4 · answered Feb 16 '17 at 19:25

0

if you really don't want to use split you could try regex:

import re
s= "the sky is blue"
count = len(re.findall(r'\w+', s))
print (count)

answered Feb 16 '17 at 19:25

Ezsrac

1
3

hi thanks a lot regex one you mentioned is atleast passing the test case which direct space count fails. Ex -- s="the sky is blue.The" gives right answer of 5 – Faith lost Feb 16 '17 at 19:35

score 0 · Answer 5 · edited Apr 18 '22 at 08:59

0

Simply, take the value of word as 1 while initializing:

print("count words")

s = "the sky is dark and lit with stars"

def countW(s):
    print(s)
    word=1
    
    for i in s:
        if i == " ":
            word=word+1
    print(word)

countW(s)

edited Apr 18 '22 at 08:59

Tyler2P

2,324
26
22
31

answered Apr 18 '22 at 07:48

Mayuresh Pachangane

3
1
2

count words in a string without using split

5 Answers5