Python - How to find out how many times the user said the word "the" or "The"

Question

sentence2 = raw_input("Enter the sentence on the StringLab3 WS: ")

sentence.split(sentence2)
for word in default_sentence:
    if word == (chr(84)+chr(104)+chr(101)) or (chr(116)+chr(104)+chr(101)):
        words += 1

print "The amounf of times 'the' or 'The' appear is a total of", words, "times."

This is what I have now, the output is currently 961 for the sentence:

This is a day of national consecration. And I am certain that on this day my fellow Americans expect that on my induction into the Presidency, I will address them with a candor and a decision which the present situation of our people impels. This is preeminently the time to speak the truth, the whole truth, frankly and boldly. Nor need we shrink from honestly facing conditions in our country today. This great Nation will endure, as it has endured, will revive and will prosper. So, first of all, let me assert my firm belief that the only thing we have to fear is fear itself, nameless, unreasoning, unjustified terror which paralyzes needed efforts to convert retreat into advance. In every dark hour of our national life, a leadership of frankness and of vigor has met with that understanding and support of the people themselves which is essential to victory. And I am convinced that you will again give that support to leadership in these critical days.

We're supposed to have the user input this. Any advice?

Why are you using `chr()` instead of just using the literal `"the"`? — Barmar, Jan 07 '14 at 21:59
The root problem here is that `word == 'the' or 'The'` doesn't mean what you think it does. (I've removed the extra obfuscation to make ti clearer.) You want `word in ('the', 'The')`. There are about 500 questions on SO that explain why. — abarnert, Jan 07 '14 at 22:01
Yeah, but the `or` problem is a common mistake. I see examples of it on SO every few days. — Barmar, Jan 07 '14 at 22:04
@Matthias: Yeah, I didn't notice that he does `sentence.split(sentence2)` and then does `for word in default_sentence:`. So, if he's lucky those are both `NameError`s; otherwise, they're probably using some old data he left lying around to do god knows what… — abarnert, Jan 07 '14 at 22:05
Somehow his code is counting the number of characters in the input. — Barmar, Jan 07 '14 at 22:08
If `default_sentence` would have been defined and if `word` would have been defined it would count the number of characters. — Matthias, Jan 07 '14 at 22:11
@Barmar: If `default_sentence` is a string, it's counting the number of characters in that string. If it's an iterable (e.g., a list of strings), it's counting the number of elements in that iterable; if it's not defined at all, it's raising a `NameError` and counting nothing. — abarnert, Jan 07 '14 at 22:23
The interesting thing is that his sample input has exactly 961 characters, which is what he said his script printed. So I guess he assigned that string to `default_input` prior to the code snippet he posted. — Barmar, Jan 07 '14 at 22:25

score 4 · Answer 1 · edited May 23 '17 at 12:19

The simplest implementation, and probably also the fastest, is:

sentence.lower().split().count('the')

Take the paragraph, turn it into lowercase, split it into words, and count how many of those words are 'the'. Almost a direct translation from the problem description.

The first problem with your attempt is that you read user input into a variable named sentence2, then use it as a separator to split some other variable named sentence, throwing away the result, then loop over yet another variable named default_sentence. That isn't going to work. Python won't guess what you mean just because variable names are kind of similar. You have to write those first three lines line this:

The second problem is that your or expression doesn't mean what you think it does. This has been explained in dozens of other questions; you can start at What's going on with my if else statement and, if that doesn't explain it, see the related links and duplicates from there.

If you solve both of those problems, your code actually works:

sentence = raw_input("Enter the sentence on the StringLab3 WS: ")
default_sentence = sentence.split()
words = 0
for word in default_sentence:
    if word in ((chr(84)+chr(104)+chr(101)), (chr(116)+chr(104)+chr(101))):
        words += 1

print "The amounf of times 'the' or 'The' appear is a total of", words, "times."

I don't know why everyone else is over-complicating this in the name of efficiency, by replacing the count with an explicit sum over a comprehension or using regexps or using map to call lower after the split instead of before or… but they're actually making things slower as well as harder to read. Which is usually the case with micro-optimizations like this… For example:

In [2829]: %timeit paragraph.lower().split().count('the')
100000 loops, best of 3: 14.2 µs per loop
In [2830]: %timeit sum([1 for word in paragraph.lower().split() if word == 'the'])
100000 loops, best of 3: 18 µs per loop
In [2831]: %timeit sum(1 for word in paragraph.lower().split() if word == 'the')
100000 loops, best of 3: 17.8 µs per loop
In [2832]: %timeit re.findall(r'\bthe\b', paragraph, re.I)
10000 loops, best of 3: 38.3 µs per loop
In [2834]: %timeit list(map(lambda word: word.lower(), paragraph.split())).count("the")
10000 loops, best of 3: 49.6 µs per loop

wnnmaw · Answer 2 · 2014-01-07T22:09:44.223

2

I'd recommend this:

map(lambda word: word.lower(), paragraph.split()).count("the")

Output:

>>> paragraph = "This is a day of national consecration. And I am certain that on this day my fellow Americans expect that on my induction into the Presidency, I will address them with a can
dor and a decision which the present situation of our people impels. This is preeminently the time to speak the truth, the whole truth, frankly and boldly. Nor need we shrink from honestly f
acing conditions in our country today. This great Nation will endure, as it has endured, will revive and will prosper. So, first of all, let me assert my firm belief that the only thing we h
ave to fear is fear itself, nameless, unreasoning, unjustified terror which paralyzes needed efforts to convert retreat into advance. In every dark hour of our national life, a leadership of
 frankness and of vigor has met with that understanding and support of the people themselves which is essential to victory. And I am convinced that you will again give that support to leader
ship in these critical days."
>>> map(lambda word: word.lower(), paragraph.split()).count("the")
7

Since my solution may look weird, here's a little explanation from left to right:

map(function, target): This applies the function to all elements of target, thus target must be a list or some other iterable. In this case, we're mapping a lambda function, which can be a little scary, so read below about that

.lower(): Takes the lower case of whatever string its applied to, word in this case. This is done to ensure that "the", "The", "THE", "ThE", and so on are all counted

.split(): This splits a string (paragraph) into a list by the separator supplied in the parenthesis. In the case of no separator (such as this one), a space is assumed to be the separator. Note that sequential separators are lumped when the separator is left out.

.count(item): This counts the instances of item in the list its applied to. Note that this is not the most efficient way to count things (gotta go regex if you about speed)

The scary lambda function:

lambda functions are not easy to explain or understand. Its taken me quite a while to get a grip on what they are and when they're useful. I found this tutorial to be rather helpful.

My best attempt at a tl;dr is lambda functions are small, anonymous functions that can be used for convenience. I know this is, at best, incomplete, but I think it should suffice for the scope of this question

edited Jan 07 '14 at 22:09

answered Jan 07 '14 at 21:58

wnnmaw

5,444
3
38
63

in Python3, `map` returns a `map` object which does not contain the `.count` method. Instead do `len([word for word in map(lambda word: word.lower(), paragraph.split()) if word=="the"])` on Python3 – Adam Smith Jan 07 '14 at 22:04
Completely rewriting the OP's code isn't going to help him, or other readers, understand the mistakes he made. – Barmar Jan 07 '14 at 22:05
2

Why even map the `lower` at all? Just `lower` the whole paragraph before splitting: `paragraph.lower().split().count('the')` is a whole lot simpler. – abarnert Jan 07 '14 at 22:05
Also, your explanation is wrong. You use `list.count`, but describe `"".count`. (And you also refer to using a regex, which cannot count things in a list. Plus, for simple substring matches, regex is usually slower than plain searching, not faster.) – abarnert Jan 07 '14 at 22:07
1

Another solution: `print(sum(1 for word in sentence.lower().split() if word == 'the'))`. – Matthias Jan 07 '14 at 22:07
No, that edit is _not_ my suggestion. Now you're mapping the identity function (`lambda word: word`) over the list of words, which is just a way of making your code harder to read and slower for no benefit. Again, there is no need for `map` here at all. – abarnert Jan 07 '14 at 22:08
@abarnert How do ```list.count``` and ```"".count``` behave differently? And my bad, I just assume regex is faster since it always shows up in speed questions – wnnmaw Jan 07 '14 at 22:08
@abarnert sorry, I'm reading too fast! – wnnmaw Jan 07 '14 at 22:09
2

You've changed your answer to refer to `list.count` now. But `str.count` searches for substrings of a string (in the case of literal `"".count(item)`, as you wrote, there are no substrings of the empty string, so it'll return `0`…), while `list.count` searches for elements that are equal. For example, if you just did `paragraph.lower().count('the')`, it would match all of the `'the'` words, and also the `'the'` at the start of `'themselves'`, and so on. – abarnert Jan 07 '14 at 22:10
@abarnert, that makes sense, thanks for pointing out the distinction between the two – wnnmaw Jan 07 '14 at 22:14
Even if you want to keep the unnecessary `map` for some reason, why not `str.lower` instead of `lambda word: word.lower()`? – abarnert Jan 07 '14 at 22:24
@abarnert because I didn't know that worked, I thought a lambda was the only way to do it – wnnmaw Jan 07 '14 at 22:25
1

@wnnmaw: In Python, almost anything that looks like it _could_ be a first-class value _is_ one. That includes "unbound methods" like `str.lower`, which can be passed around and called the same way as functions or bound methods. To call a bound method, you pass the `self` as an explicit first argument. And if you look at what `map` does, that's exactly what you want here. – abarnert Jan 07 '14 at 22:27

Barmar · Answer 3 · 2014-01-07T22:11:06.687

The problem is this line:

if word == (chr(84)+chr(104)+chr(101)) or (chr(116)+chr(104)+chr(101)):

Comparisons in most programming languages cannot be abbreviated like they can in English, you can't write "equal to A or B" as short for "equal to A or equal to B", you need to write it out:

if word == (chr(84)+chr(104)+chr(101)) or word == (chr(116)+chr(104)+chr(101)):

What you wrote is parsed as:

if (word == (chr(84)+chr(104)+chr(101))) or (chr(116)+chr(104)+chr(101)):

Since the second expression in the or is always true (it's a string, and all non-empty strings are true), the if always succeeds, so you count all the words, not just the and The.

There's also no good reason to use that verbose chr() syntax, just write:

if word == "the" or word == "The":

There are other bugs in your code. The split line should be:

default_sentence = sentence2.split();

score 1 · Answer 4 · answered Jan 07 '14 at 22:08

1

You can do it like this, using regexes:

#!/usr/bin/env python
import re
input_string = raw_input("Enter your string: ");
print("Total occurences of the word 'the': %d"%(len(re.findall(r'\b(T|t)he\b', input_string)),));

and if you want it to be case insensitive the call to re.findall can just be changed to re.findall(r'\bthe\b', input_string, re.I)

answered Jan 07 '14 at 22:08

Nick Beeuwsaert

1,598
1
11
18

This is exactly how I'd do it, but it IS slower than some other implementations already answered. I find it to be the cleanest, but YMMV. I think the fastest implementation is likely `sum([1 for word in input_string.lower().split() if word=="the"])` – Adam Smith Jan 07 '14 at 22:10
@adsmith: I'm willing to bet `list.count` is faster than calling `sum` on a list comprehension. And of course it's simpler and more readable to boot. – abarnert Jan 07 '14 at 22:11
1

Yeah, it's slower, but I jump at any chance to abuse regular expressions :P – Nick Beeuwsaert Jan 07 '14 at 22:12
@abarnert `timeit` agrees with you, at least in this use case. `input_string.lower().split().count("the")` finishes in 75% the time of `sum([1 for word in input_string.lower().split() if word=="the"])`. The More You Know. (I actually had never heard of the `list.count` method!) – Adam Smith Jan 07 '14 at 22:18

Chris Barker · Answer 5 · 2014-01-07T22:25:52.323

The reason your code isn't working is because you wrote

if word == (chr(84)+chr(104)+chr(101)) or (chr(116)+chr(104)+chr(101)):
# evaluates to: if word == "The" or "the":
# evaluates to: if False or "the":
# evaluates to: if "the":

Instead of

if (word == (chr(84)+chr(104)+chr(101))) or (word == (chr(116)+chr(104)+chr(101))):
# evaluates to: if (word == "The") or (word == "the")

More importantly, as Barmar pointed out, using the string literal 'the' is much more readable.

So you might want something like this:

count = 0
for word in default_sentence.split():
    if word == 'the' or word == 'The':
        count += 1

wnnmaw has an equivalent one-liner which works almost as well. map(lambda word: word.lower()) doesn't quite work, because by OP's spec, we only want to count 'the' and 'The', not 'THE'.

You need to initialize `count` _outside_ the loop. – Barmar Jan 07 '14 at 22:14 — Barmar, Jan 07 '14 at 22:14

Python - How to find out how many times the user said the word "the" or "The"

5 Answers5