1

I need to write a script that finds all of the capitalized words (not words in all caps, just the initial letter) in a text file and presents them in alphabetical order.

I tried to use a regex like this:

re.findall(r'\b[A-Z][a-z]*\b', line)

but my function returns this output:

Enter the file name: bzip2.txt
['A', 'All', 'Altered', 'C', 'If', 'Julian', 'July', 'R', 'Redistribution', 'Redistributions', 'Seward', 'The', 'This']

How can I remove all the single-letter words (ex: A, C, and R)?

Michael M.
  • 10,486
  • 9
  • 18
  • 34
Shiv Shah
  • 11
  • 1

2 Answers2

5

You can do this within the regex itself, no need to filter the array. Just use + instead of *:

re.findall(r'\b[A-Z][a-z]+\b', line)

In RegEx, * means to match zero or more times, while + means to match one or more times. Hence, your original code matched the lowercase letters zero times, so it was essentially ignored). With the +, it will be forced to match at least once. You can learn more about this from this question and its answers.

Also, credit where credit is due: blhsing also pointed this out in the comments of the original question while I was writing this answer.

Michael M.
  • 10,486
  • 9
  • 18
  • 34
0

Instead of using a regex, split and directly check

  • has at least 2 characters
  • first letter is a capitalized letter

Then call sorted() to get a sorted list

>>> alphabet = set("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
>>> sorted(filter(lambda word: len(word) >= 2 and word[0] in alphabet, my_collection))
['All', 'Altered', 'If', 'Julian', 'July', 'Redistribution', 'Redistributions', 'Seward', 'The', 'This']
ti7
  • 16,375
  • 6
  • 40
  • 68