Regex pattern with accented characters

Question

I am trying to get the words that start with a capital letter regardless of whether it has a special character or not in the word. Currently, my pattern only gets capital letters without accents.

I don't need numbers or hyphens, just accents or special characters in the letters.

pattern = r"\b[A-Z][a-z]*\b"
name = soup.select('h1.data-header__headline-wrapper')[0].text.strip()
name = re.findall(pattern, name)
name = " ".join(name)

Some examples. Special characters should be included to correctly return player 1 and 4.

�lvaro Fern�ndez
[]

#3                    
                                            Rico Henry
['Rico', 'Henry']
Rico Henry
#24                    
                                            Tariqe Fosu
['Tariqe', 'Fosu']
Tariqe Fosu
#29                    
                                            Mads Bech S�rensen
['Mads', 'Bech']
Mads Bech

Could you provide an example string of what is going in and what you'd like to extract? — Gene Burinsky, Aug 08 '22 at 21:32
These are names of players from various leagues. For example, the extraction is "#3 Álvaro Fernández" and I would like to get "Álvaro Fernández". @GeneBurinsky — , Aug 09 '22 at 12:35
I second Gene Burinsky's comment here, please do add samples of input and expected output more clearly in your question; to make it more clear. thank you. — RavinderSingh13, Aug 09 '22 at 13:00
Examples added. Player 2 and 3 are correct. @RavinderSingh13 — , Aug 09 '22 at 15:28

Wiktor Stribiżew · Accepted Answer · 2022-08-09T12:55:29.710

0

You need to pip install regex in your console and then use

import regex
pattern = r"\b\p{Lu}\p{Ll}*\b"
name = soup.select('h1.data-header__headline-wrapper')[0].text.strip()
name = regex.findall(pattern, name)
name = " ".join(name)

Here,

\b - a word boundary
\p{Lu} - an uppercase letter
\p{Ll}* - zero or more lowercase letters.

If you cannot install the PyPi regex module you need to build ranges of upper- and lowercase letters (based on Python regex for unicode capitalized words):

import sys, re
pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
pLl = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).islower()]))
pattern = fr"\b{pLu}{pLl}*\b"
name = soup.select('h1.data-header__headline-wrapper')[0].text.strip()
print ( " ".join( regex.findall(pattern, name) ) )

See this Python demo.

edited Aug 09 '22 at 12:55

answered Aug 08 '22 at 21:34

Wiktor Stribiżew

607,720
39
448
563

It is not working. It jumps error when it reaches the character with accent. "re.error: bad escape \p at position 2" – Aug 09 '22 at 12:40
@Carlos Did you install the PyPi regex module? See "***You need to `pip install regex` in your console***". See [the Python demo online](https://tio.run/##K6gsycjPM/7/PzO3IL@oRKEoNT21gqsgsaQktShPwVahSCkmKaag2qe0FkTm1GrFJClx5SXmpgLllA53JiUrHG7My0tUcCwoyElVcD68MDexKFHBOb8oserw5jyF4NTDG/OL4DrApuulZealJObkaEBt0VEAyWrCTVVQ0svKz8zTgIgWFGXmlUDY//8DAA). – Wiktor Stribiżew Aug 09 '22 at 12:43
@Carlos I added a `re` based solution if you cannot afford another dependency in the project. – Wiktor Stribiżew Aug 09 '22 at 12:55
Yeah, but it's not working. Same error. It´s not including the special characterds – Aug 09 '22 at 15:29
@Carlos Show me your code using tio.run. See [my code is working](https://tio.run/##rZExbsMwDEV3nYJwh0hF4KVzhyJrtoxxBtmRYxUyRdAymtTw0LHH6BGKHsEHc@XEcKcCHQoBhPApPpJfdAmVx4dxtDV5DsDmZM6CdAiGER6BkyzPqNu2/RRdf5/liUBdm5hLhve8gOENUcMTkTOwGT5qzRo2nvXr8IWwM8On56XiSk9Li0ftnJy7rGHKqoUKSfrsLcqbSmwxzHdxFwmgidjrohLzyM2lWUdd0LaN5at91x9Waem51kEmM2tfVCytgiiDBYvAGk9GxtK01ucWbeGPRoEt4fYwtU1LZFiqg4ozbN1/kp1/WciL0eXkdBd36GOYfb7uDvLHkt8M/PtPKIhHjOM3). – Wiktor Stribiżew Aug 09 '22 at 15:30
1

The second option worked. Thanks!! – Aug 09 '22 at 16:52

Regex pattern with accented characters

1 Answers1