I've seen a lot of different posts for handling accented characters, but none that specifically find accented characters in a corpus of text. I'm trying to identify words in the text like nǚ
, but the code should not include non-Latin-alphabet results. Ex: 女 should not be selected. The string I'm using for testing is:
"nǚ – woman; girl; daughter; female. A pictogram of a woman with her arms stretched. In old versions she was seated on her knees. It is a radical that forms part tón of characters related to women and their qualities. 女儿 nǚ'ér – daughter (woman + child) ǚa"
A working regex should select:
- nǚ
- nǚ'ér
- ǚa
- tón
Note: There is a similar question here, but the problem is different. This person is just having trouble using regex with accents.