Find all accented words/characters with Python regex?

Question

I've seen a lot of different posts for handling accented characters, but none that specifically find accented characters in a corpus of text. I'm trying to identify words in the text like nǚ, but the code should not include non-Latin-alphabet results. Ex: 女 should not be selected. The string I'm using for testing is:

"nǚ – woman; girl; daughter; female. A pictogram of a woman with her arms stretched. In old versions she was seated on her knees.  It is a radical that forms part tón of characters related to women and their qualities. 女儿   nǚ'ér – daughter (woman + child) ǚa"

A working regex should select:

nǚ
nǚ'ér
ǚa
tón

Note: There is a similar question here, but the problem is different. This person is just having trouble using regex with accents.

What is this double accent `ǚ` ? I got regex for the other but not for that one — azro, Jun 07 '20 at 20:54
It's an unusual Chinese sound. It has the same pronunciation as the German umlaut, but the little carrot deal is indicating that it's a third tone. I picked it on purpose because it's one of the weirder ones I have to identify. — Grant Curell, Jun 07 '20 at 21:10
seem to be referenciwng unicode. have idea what latin means for unicode, yes ? — , Jun 07 '20 at 21:17
can accent chars be a join of 2 chars ? visually how know from 1 or 2, ? dont, yes ? — , Jun 07 '20 at 21:18

score 1 · Accepted Answer · answered Jun 07 '20 at 20:58

1

To match the accented letter, from this post you can use

[\u00C0-\u017F]
[À-ÖØ-öø-ÿ]

ǚ is not included in but you can extend unicode range to its value : [\u00C0-\u01DA]
' is not an accent you have to add it manually

Giving final \w*[\u00C0-\u01DA']\w* and Code Demo

answered Jun 07 '20 at 20:58

azro

53,056
7
34
70

1

Dude, I have no idea why someone downvoted you - you're a champion. I had some ideas, but that is soooooo much better than what I came up with. You just saved me a huge amount of time. I'd hug you if we were in person. – Grant Curell Jun 07 '20 at 21:09
@GrantCurell - social distancing, remember? :P – MattDMo Jun 08 '20 at 12:13
1

@MattDMo No virus can halt my borderline inappropriate friendliness. – Grant Curell Jun 08 '20 at 17:08

score 0 · Answer 2 · answered Jun 08 '20 at 20:06

A generic solution for Cyrillic, Arabic, etc. would be

[x for x in re.findall(r"\b[^\W\d_]+(?:['’][^\W\d_]+)*\b", s) 
    if re.search(r'[A-Za-z]',x) and re.search(r'(?![a-zA-Z])[^\W\d_]',x)]

re.findall(r"\b[^\W\d_]+(?:['’][^\W\d_]+)*\b" - finds all words that may contain apostrophes
if re.search(r'[A-Za-z]',x) - make sure there is a letter from ASCII range
re.search(r'(?![a-zA-Z])[^\W\d_]',x) - also, make sure there is a letter outside of ASCII range.

Find all accented words/characters with Python regex?

2 Answers2