Golang regexp Boundary with Latin Character

Question

I have a small tricky issue about golang regex. seems \b boundering option doesn't work when I put latein chars like this.

I expected that é should be treated as a regular char.. but it's treated as one of boundering wards.

package main

import (
    "fmt"
    "regexp"
)

func main() {   
    r, _ := regexp.Compile(`\b(vis)\b`)
    fmt.Println(r.MatchString("re vis e"))
    fmt.Println(r.MatchString("revise"))
    fmt.Println(r.MatchString("révisé"))
}

result was:

true 
false 
true

Please give me any suggestion how to deal with r.MatchString("révisé") as false ?

Thank you

Yes, `\b` word boundaries only support ASCII--the docs say `at ASCII word boundary (\w on one side and \W, \A, or \z on the other)`. The only option is probably to explicitly match for characters you consider a word boundary in the regex (spaces, newlines, full stops, end of string, etc) — Herman Schaaf, Feb 04 '16 at 05:02

Herman Schaaf · Accepted Answer · 2016-02-04T05:14:41.567

The issue is that \b is only for boundaries around ASCII characters, as stated in the docs:

at ASCII word boundary (\w on one side and \W, \A, or \z on the other)

And é is not ASCII. But, you can make your own \b replacement by combining other regex shortcuts. Here is a simple solution that solves the case given in the question, though you may want to add more thorough matching:

package main

import (
    "fmt"
    "regexp"
)

func main() {   
    r, _ := regexp.Compile(`(?:\A|\s)(vis)(?:\s|\z)`)
    fmt.Println(r.MatchString("vis")) // added this case
    fmt.Println(r.MatchString("re vis e"))
    fmt.Println(r.MatchString("revise"))
    fmt.Println(r.MatchString("révisé"))
}

Running this gives:

true
true
false
false

What this solution does is essentially replace \b with (?:\A|\z|\s), which means "a non-capturing group with one of the following: start of string, end of string or whitespace". You may want to add other possibilities here, like punctuation.

This works great. Including punctuation at the end looks like: `(?:[[:punct:]]|\s|\z)` — Matt Baer, Jan 11 '18 at 17:11

Golang regexp Boundary with Latin Character

1 Answers1

Linked

Related