Postgresql Regex for Word Boundary not Working

Question

Using v9.6.2 (didn't try other versions).

First one works. Second one with a Kanji word fails. What gives?

dev=> select 'foo bar' ~ '\ybar\y' v;
 v
---
 t
(1 row)

dev=> select '積極的 積極的' ~ '\y積極的\y' v;
 v
---
 f
(1 row)

http://stackoverflow.com/questions/280712/javascript-unicode-regexes (Javascript, but I'm pretty sure that the issue is the same.) — Dmitri Goldring, May 07 '17 at 15:56
[From the docs](https://www.postgresql.org/docs/current/static/functions-matching.html): *A word is defined as a sequence of word characters that is neither preceded nor followed by word characters. A word character is an alnum character (**as defined by ctype**) or an underscore.* -- [try with](http://rextester.com/QMJE27595) `COLLATE "zh_CN"` — pozs, May 08 '17 at 12:52

Leo C · Answer 1 · 2017-05-08T01:13:47.253

1

It'll work without enclosing by \y:

SELECT '積極的 積極的' ~ '積極的' AS v;
 v 
---
 t
(1 row)

regexp_matches will work too:

SELECT regexp_matches('積極的 積極的', '^.*(積極的).*$') AS v;
    v     
----------
 {積極的}
(1 row)

[UPDATE]

Contemporary Chinese characters are rendered using Unicode, which not all programming platforms fully support when it comes to regex word boundaries. I suppose PostgreSQL isn't using a regex engine that supports Unicode word boundaries.

Some programming languages like Scala (Java as well) do support Unicode with word boundaries:

scala> """\b積極的\b""".r findFirstIn "積極的 積極的"
res4: Option[String] = Some(積極的)

Note that \b, not \y, is used for word boundaries in Scala/Java.

edited May 08 '17 at 01:13

answered May 07 '17 at 16:03

Leo C

22,006
3
26
39

Hmmm, my bad for not noticing word boundary was the focus. Updated my answer. – Leo C May 08 '17 at 00:49
Thanks, yeah, probably that Postgres doesn't support it. Appreciate your effort. – user2297550 May 08 '17 at 04:20

Postgresql Regex for Word Boundary not Working

1 Answers1