2

Why (^|\b)на́($|\b) doesn't match віч на́ віч?

re.sub(r'(^|\b)на́($|\b)', 'на', 'віч на́ віч', flags=re.UNICODE) is giving 'віч на́ віч', while I want віч на віч.

Paul R
  • 2,631
  • 3
  • 38
  • 72
  • Please provide a short, complete program that demonstrates the problem. See [mcve] for more information. – Robᵩ Oct 24 '17 at 18:00
  • Try \s instead of \b – Alex von Brandenfels Oct 24 '17 at 18:03
  • 1
    @GrumpyCrouton A [mcve]. – Tomalak Oct 24 '17 at 18:03
  • @Tomalak but it _is_ - there is literally no more information that is needed – GrumpyCrouton Oct 24 '17 at 18:04
  • @AlexvonBrandenfels Not a good solution. I'm pretty sure things like `віч (на́) віч` should match as well. – Tomalak Oct 24 '17 at 18:04
  • 2
    @Tomalak it is minimal, complete, and verifiable. I was able to reproduce his problem on regexr.com – Alex von Brandenfels Oct 24 '17 at 18:04
  • @AlexvonBrandenfels Since the meaning of `\b` varies by regex flag, a three-liner that shows how the OP is executing the regex is better than a naked regex. – Tomalak Oct 24 '17 at 18:05
  • 1
    `\b` checks against `a-zA-Z0-9_` (unless `u` modifier is used), but even then can be finicky. See [this post](https://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters) for more information – ctwheels Oct 24 '17 at 18:06
  • I believe the issue is the characters: на́ are both in fact word boundaries. See how this works: import re text = "## " if re.search(r"##\b", text): print ("it will not get here") else: print ("see?") – sniperd Oct 24 '17 at 18:11
  • 1
    @sniperd technically, yes, but that's only because `\b` checks against ASCII characters in *most* flavours of regex. I believe Java does this right (works with Unicode), but I can't say for other languages – ctwheels Oct 24 '17 at 18:13
  • @AlexvonBrandenfels you can use this `(?:^|(?<=\W))на́(?=\W|$)` – ctwheels Oct 24 '17 at 18:16
  • @ctwheels Same regex is `(?<!\w)на́(?!\w)` – Wiktor Stribiżew Oct 24 '17 at 18:20
  • @WiktorStribiżew yep you're right that's much better too. – ctwheels Oct 24 '17 at 18:27
  • @ctwheels The question must be reopened or the close reason must be changed. Do you know any good thread where the `(?<!\w)` / `(?!\w)` boundaries are dwelt upon? – Wiktor Stribiżew Oct 24 '17 at 20:59
  • @WiktorStribiżew not specifically, the closest I could find was [difference between \w and \b regular expression meta characters](https://stackoverflow.com/questions/11874234/difference-between-w-and-b-regular-expression-meta-characters), which doesn't cover them being used in the way expressed in the comments above. [This](https://stackoverflow.com/a/6880566/3600709), [this](https://stackoverflow.com/a/4215293/3600709), and [this](https://stackoverflow.com/a/4295621/3600709) also suggest it (esp. latter). [symbolhound.com](http://symbolhound.com/advanced.php) to search the web for symbols. – ctwheels Oct 24 '17 at 21:21

1 Answers1

1

Use \W:

import re
s = "віч на́ віч"
final_s = re.findall('\W+', s)[0]

Output:

"віч на́ віч"
Ajax1234
  • 69,937
  • 8
  • 61
  • 102