Find the words from the file whose first letter is capitalized

Question

I need to write a script that finds all of the capitalized words (not words in all caps, just the initial letter) in a text file and presents them in alphabetical order.

I tried to use a regex like this:

re.findall(r'\b[A-Z][a-z]*\b', line)

but my function returns this output:

Enter the file name: bzip2.txt
['A', 'All', 'Altered', 'C', 'If', 'Julian', 'July', 'R', 'Redistribution', 'Redistributions', 'Seward', 'The', 'This']

How can I remove all the single-letter words (ex: A, C, and R)?

You could iterate over the list and remove words that are length 1. Or alternatively, keep words that are longer than 1 character. — Code-Apprentice, Nov 07 '22 at 02:01

Michael M. · Answer 1 · 2022-11-07T02:37:59.907

5

You can do this within the regex itself, no need to filter the array. Just use + instead of *:

re.findall(r'\b[A-Z][a-z]+\b', line)

In RegEx, * means to match zero or more times, while + means to match one or more times. Hence, your original code matched the lowercase letters zero times, so it was essentially ignored). With the +, it will be forced to match at least once. You can learn more about this from this question and its answers.

Also, credit where credit is due: blhsing also pointed this out in the comments of the original question while I was writing this answer.

edited Nov 07 '22 at 02:37

answered Nov 07 '22 at 02:05

Michael M.

10,486
9
18
34

1

@StevenRumbalski Thanks, I've updated my answer with a short description. – Michael M. Nov 07 '22 at 02:23
@blhsing Sorry! I didn't realize you also pointed this out in the comments. I've updated my answer with an acknowledgment. – Michael M. Nov 07 '22 at 02:38

ti7 · Answer 2 · 2022-11-07T02:31:26.250

0

Instead of using a regex, split and directly check

has at least 2 characters
first letter is a capitalized letter

Then call sorted() to get a sorted list

>>> alphabet = set("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
>>> sorted(filter(lambda word: len(word) >= 2 and word[0] in alphabet, my_collection))
['All', 'Altered', 'If', 'Julian', 'July', 'Redistribution', 'Redistributions', 'Seward', 'The', 'This']

edited Nov 07 '22 at 02:31

answered Nov 07 '22 at 02:25

ti7

16,375
6
40
68

Find the words from the file whose first letter is capitalized

2 Answers2