0

I am trying to get the words that start with a capital letter regardless of whether it has a special character or not in the word. Currently, my pattern only gets capital letters without accents.

I don't need numbers or hyphens, just accents or special characters in the letters.

pattern = r"\b[A-Z][a-z]*\b"
name = soup.select('h1.data-header__headline-wrapper')[0].text.strip()
name = re.findall(pattern, name)
name = " ".join(name)

Some examples. Special characters should be included to correctly return player 1 and 4.

�lvaro Fern�ndez
[]

#3                    
                                            Rico Henry
['Rico', 'Henry']
Rico Henry
#24                    
                                            Tariqe Fosu
['Tariqe', 'Fosu']
Tariqe Fosu
#29                    
                                            Mads Bech S�rensen
['Mads', 'Bech']
Mads Bech
  • 1
    Could you provide an example string of what is going in and what you'd like to extract? – Gene Burinsky Aug 08 '22 at 21:32
  • These are names of players from various leagues. For example, the extraction is "#3 Álvaro Fernández" and I would like to get "Álvaro Fernández". @GeneBurinsky –  Aug 09 '22 at 12:35
  • I second Gene Burinsky's comment here, please do add samples of input and expected output more clearly in your question; to make it more clear. thank you. – RavinderSingh13 Aug 09 '22 at 13:00
  • Examples added. Player 2 and 3 are correct. @RavinderSingh13 –  Aug 09 '22 at 15:28

1 Answers1

0

You need to pip install regex in your console and then use

import regex
pattern = r"\b\p{Lu}\p{Ll}*\b"
name = soup.select('h1.data-header__headline-wrapper')[0].text.strip()
name = regex.findall(pattern, name)
name = " ".join(name)

Here,

  • \b - a word boundary
  • \p{Lu} - an uppercase letter
  • \p{Ll}* - zero or more lowercase letters.

If you cannot install the PyPi regex module you need to build ranges of upper- and lowercase letters (based on Python regex for unicode capitalized words):

import sys, re
pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
pLl = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).islower()]))
pattern = fr"\b{pLu}{pLl}*\b"
name = soup.select('h1.data-header__headline-wrapper')[0].text.strip()
print ( " ".join( regex.findall(pattern, name) ) )

See this Python demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • It is not working. It jumps error when it reaches the character with accent. "re.error: bad escape \p at position 2" –  Aug 09 '22 at 12:40
  • @Carlos Did you install the PyPi regex module? See "***You need to `pip install regex` in your console***". See [the Python demo online](https://tio.run/##K6gsycjPM/7/PzO3IL@oRKEoNT21gqsgsaQktShPwVahSCkmKaag2qe0FkTm1GrFJClx5SXmpgLllA53JiUrHG7My0tUcCwoyElVcD68MDexKFHBOb8oserw5jyF4NTDG/OL4DrApuulZealJObkaEBt0VEAyWrCTVVQ0svKz8zTgIgWFGXmlUDY//8DAA). – Wiktor Stribiżew Aug 09 '22 at 12:43
  • @Carlos I added a `re` based solution if you cannot afford another dependency in the project. – Wiktor Stribiżew Aug 09 '22 at 12:55
  • Yeah, but it's not working. Same error. It´s not including the special characterds –  Aug 09 '22 at 15:29
  • @Carlos Show me your code using tio.run. See [my code is working](https://tio.run/##rZExbsMwDEV3nYJwh0hF4KVzhyJrtoxxBtmRYxUyRdAymtTw0LHH6BGKHsEHc@XEcKcCHQoBhPApPpJfdAmVx4dxtDV5DsDmZM6CdAiGER6BkyzPqNu2/RRdf5/liUBdm5hLhve8gOENUcMTkTOwGT5qzRo2nvXr8IWwM8On56XiSk9Li0ftnJy7rGHKqoUKSfrsLcqbSmwxzHdxFwmgidjrohLzyM2lWUdd0LaN5at91x9Waem51kEmM2tfVCytgiiDBYvAGk9GxtK01ucWbeGPRoEt4fYwtU1LZFiqg4ozbN1/kp1/WciL0eXkdBd36GOYfb7uDvLHkt8M/PtPKIhHjOM3). – Wiktor Stribiżew Aug 09 '22 at 15:30
  • 1
    The second option worked. Thanks!! –  Aug 09 '22 at 16:52