Regex unicode and accent

Question

Why (^|\b)на́($|\b) doesn't match віч на́ віч?

re.sub(r'(^|\b)на́($|\b)', 'на', 'віч на́ віч', flags=re.UNICODE) is giving 'віч на́ віч', while I want віч на віч.

Please provide a short, complete program that demonstrates the problem. See [mcve] for more information. — Robᵩ, Oct 24 '17 at 18:00
@Tomalak but it _is_ - there is literally no more information that is needed — GrumpyCrouton, Oct 24 '17 at 18:04
@AlexvonBrandenfels Not a good solution. I'm pretty sure things like `віч (на́) віч` should match as well. — Tomalak, Oct 24 '17 at 18:04
@Tomalak it is minimal, complete, and verifiable. I was able to reproduce his problem on regexr.com — Alex von Brandenfels, Oct 24 '17 at 18:04
@AlexvonBrandenfels Since the meaning of `\b` varies by regex flag, a three-liner that shows how the OP is executing the regex is better than a naked regex. — Tomalak, Oct 24 '17 at 18:05
`\b` checks against `a-zA-Z0-9_` (unless `u` modifier is used), but even then can be finicky. See [this post](https://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters) for more information — ctwheels, Oct 24 '17 at 18:06
I believe the issue is the characters: на́ are both in fact word boundaries. See how this works: import re text = "## " if re.search(r"##\b", text): print ("it will not get here") else: print ("see?") — sniperd, Oct 24 '17 at 18:11
@sniperd technically, yes, but that's only because `\b` checks against ASCII characters in *most* flavours of regex. I believe Java does this right (works with Unicode), but I can't say for other languages — ctwheels, Oct 24 '17 at 18:13
@AlexvonBrandenfels you can use this `(?:^|(?<=\W))на́(?=\W|$)` — ctwheels, Oct 24 '17 at 18:16
@ctwheels The question must be reopened or the close reason must be changed. Do you know any good thread where the `(?<!\w)` / `(?!\w)` boundaries are dwelt upon? — Wiktor Stribiżew, Oct 24 '17 at 20:59
@WiktorStribiżew not specifically, the closest I could find was [difference between \w and \b regular expression meta characters](https://stackoverflow.com/questions/11874234/difference-between-w-and-b-regular-expression-meta-characters), which doesn't cover them being used in the way expressed in the comments above. [This](https://stackoverflow.com/a/6880566/3600709), [this](https://stackoverflow.com/a/4215293/3600709), and [this](https://stackoverflow.com/a/4295621/3600709) also suggest it (esp. latter). [symbolhound.com](http://symbolhound.com/advanced.php) to search the web for symbols. — ctwheels, Oct 24 '17 at 21:21

score 1 · Answer 1 · answered Oct 24 '17 at 18:05

1

Use \W:

import re
s = "віч на́ віч"
final_s = re.findall('\W+', s)[0]

Output:

"віч на́ віч"

answered Oct 24 '17 at 18:05

Ajax1234

69,937
8
61
102

Regex unicode and accent

1 Answers1

Linked