0

How can I find the words in a string that start with a capital letter?

Example input:

input_str = "The Persian League is the largest sport event dedicated to the deprived areas of Iran. The Persian League promotes peace and friendship. This video was captured by one of our heroes who wishes peace."

Expected output:

Persian League Iran Persian League
bouteillebleu
  • 2,456
  • 23
  • 32
  • 3
    What about `The` and `This`? Is there a stoplist? – Jarvis Dec 30 '20 at 15:59
  • Welcome to SO! Check out the [tour], and [ask] if you want tips. I'm happy you're getting good answers here, but just want to mention, you'll generally get better answers if you put in some effort to finding a solution yourself. To start, there are existing questions about [splitting a string into words](https://stackoverflow.com/q/743806/4518341) and [checking if a word is capitalized](https://stackoverflow.com/q/7353968/4518341). – wjandrea Dec 30 '20 at 16:32
  • Is the issue resolved? If so, please mark the correct answer as accepted to close it. – Jarvis Jan 11 '21 at 15:53

5 Answers5

4

Assuming you can accept The and This as well:

import re
input_string = "The Persian League is the largest sport event dedicated to the deprived areas of Iran. The Persian League promotes peace and friendship. This video was captured by one of our heroes who wishes peace."
matches = re.findall("([A-Z].+?)\W", input_string)

gives

['The', 'Persian', 'League', 'Iran', 'The', 'Persian', 'League', 'This']

If you need to ignore The and This:

matches = re.findall("(?!The|This)([A-Z].+?)\W", input_string)

gives

['Persian', 'League', 'Iran', 'Persian', 'League']
Jarvis
  • 8,494
  • 3
  • 27
  • 58
2

Without regex:

txt = "The Persian League is the largest sport event dedicated to the deprived areas of Iran. The Persian League promotes peace and friendship."

print([w for w in txt.split() if w.istitle()])

Output:

['The', 'Persian', 'League', 'Iran.', 'The', 'Persian', 'League']

If you want to skip the The word (or any other word for that matter) try this:

print(" ".join(w.replace(".", "") for w in txt.split() if w[0].isupper() and w not in ["The", "This"]))

Output:

Persian League Iran Persian League
baduker
  • 19,152
  • 9
  • 33
  • 56
  • 2
    I mentioned this as a comment on another answer that `istitle` may result in unexpected behavior since it will return `False` if other capitalized letters exist in the string – Wondercricket Dec 30 '20 at 16:07
  • 1
    Also, using `.split()` will result as `'Iran.'` (with dot). – namgold Dec 30 '20 at 16:08
  • Regex certainly seems the better choice if you need to ignore stopwords IMO. – Jarvis Dec 30 '20 at 16:14
1
s = """
The Persian League is the largest sport event dedicated to the deprived areas 
of Iran. The Persian League promotes peace and friendship. This video was 
captured by one of our heroes who wishes peace.
"""
print( [ x for x in s.split() if x[0].isupper() ])
iqmaker
  • 2,162
  • 25
  • 24
0

Try this:

import re
inputString = "The Persian League is the largest sport event dedicated to the deprived areas of Iran. The Persian League promotes peace and friendship."
splitted = re.split(' |\.', inputString)
result = filter(lambda x: len(x) > 0 and x[0].isupper(), splitted)
print(list(result))

Result:

['The', 'Persian', 'League', 'Iran', 'The', 'Persian', 'League']
namgold
  • 1,009
  • 1
  • 11
  • 32
  • `filter(lambda)` is ugly. Use a comprehension instead: `[w for w in b if w[0] >= 'A' and w[0] <= 'Z']` – wjandrea Dec 30 '20 at 15:59
  • 1
    Even less ugly, use the builtin `istitle` method of `str` objects: `[w for w in b if w.istitle()]` – Antimon Dec 30 '20 at 16:00
  • 1
    @Antimon `istitle` may result in unexpected behavior. It will return `False` if there are other capitalized letters in the string (ie `PerSian`) – Wondercricket Dec 30 '20 at 16:01
  • Which is ugly is just your opinion. IMO, I prefer `filter` than comprehension. – namgold Dec 30 '20 at 16:02
  • 1
    @Wondercricket Good point. `w[0].isupper()` would work better. I just find the looks of the string comparisons a bit confusing. – Antimon Dec 30 '20 at 16:03
  • 1
    @Antimon `isupper()` method is good. I will update my answer with that method. – namgold Dec 30 '20 at 16:05
  • @111e75b0 Fair enough :) You do you, I'm just [opinionated](https://stackoverflow.com/a/61945553/4518341) about overusing functional constructs. Python is a multi-paradigmatic language for a reason after all. Sorry if I sounded judgy. – wjandrea Dec 30 '20 at 16:08
-2

Another way to solve is using for to read data and put the words with capital letters in a list.

phrase = 'The Persian League is the largest sport event dedicated to the deprived areas of Iran. The Persian League promotes peace and friendship. This video was captured by one of our heroes who wishes peace.'

wordsplit = phrase.split(' ')
capitalLettersWords = []
for word in wordsplit:
    if word[0].isupper():
        capitalLettersWords.append(word)

print(capitalLettersWords)
#['The', 'Persian', 'League', 'Iran.', 'The', 'Persian', 'League', 'This']

In my example I used the str.isupper() and str.split(), both built-in methods from Python standard lib.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Danizavtz
  • 3,166
  • 4
  • 24
  • 25
  • This is exactly what list comprehensions mean to avoid. No need to bend over backwards to make your code slower *and* more complicated at the same time. – Antimon Dec 30 '20 at 16:11
  • I would like to write a no-list-comprehension version. – Danizavtz Dec 30 '20 at 16:21