0

I've seen a lot of different posts for handling accented characters, but none that specifically find accented characters in a corpus of text. I'm trying to identify words in the text like , but the code should not include non-Latin-alphabet results. Ex: 女 should not be selected. The string I'm using for testing is:

"nǚ – woman; girl; daughter; female. A pictogram of a woman with her arms stretched. In old versions she was seated on her knees.  It is a radical that forms part tón of characters related to women and their qualities. 女儿   nǚ'ér – daughter (woman + child) ǚa"

A working regex should select:

  • nǚ'ér
  • ǚa
  • tón

Note: There is a similar question here, but the problem is different. This person is just having trouble using regex with accents.

Grant Curell
  • 1,321
  • 2
  • 16
  • 32
  • What is this double accent `ǚ` ? I got regex for the other but not for that one – azro Jun 07 '20 at 20:54
  • It's an unusual Chinese sound. It has the same pronunciation as the German umlaut, but the little carrot deal is indicating that it's a third tone. I picked it on purpose because it's one of the weirder ones I have to identify. – Grant Curell Jun 07 '20 at 21:10
  • seem to be referenciwng unicode. have idea what latin means for unicode, yes ? –  Jun 07 '20 at 21:17
  • can accent chars be a join of 2 chars ? visually how know from 1 or 2, ? dont, yes ? –  Jun 07 '20 at 21:18

2 Answers2

1

To match the accented letter, from this post you can use

  • [\u00C0-\u017F]
  • [À-ÖØ-öø-ÿ]
  • ǚ is not included in but you can extend unicode range to its value : [\u00C0-\u01DA]
  • ' is not an accent you have to add it manually

Giving final \w*[\u00C0-\u01DA']\w* and Code Demo

azro
  • 53,056
  • 7
  • 34
  • 70
  • 1
    Dude, I have no idea why someone downvoted you - you're a champion. I had some ideas, but that is soooooo much better than what I came up with. You just saved me a huge amount of time. I'd hug you if we were in person. – Grant Curell Jun 07 '20 at 21:09
  • @GrantCurell - social distancing, remember? :P – MattDMo Jun 08 '20 at 12:13
  • 1
    @MattDMo No virus can halt my borderline inappropriate friendliness. – Grant Curell Jun 08 '20 at 17:08
0

A generic solution for Cyrillic, Arabic, etc. would be

[x for x in re.findall(r"\b[^\W\d_]+(?:['’][^\W\d_]+)*\b", s) 
    if re.search(r'[A-Za-z]',x) and re.search(r'(?![a-zA-Z])[^\W\d_]',x)]
  • re.findall(r"\b[^\W\d_]+(?:['’][^\W\d_]+)*\b" - finds all words that may contain apostrophes
  • if re.search(r'[A-Za-z]',x) - make sure there is a letter from ASCII range
  • re.search(r'(?![a-zA-Z])[^\W\d_]',x) - also, make sure there is a letter outside of ASCII range.
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37