0

Using v9.6.2 (didn't try other versions).

First one works. Second one with a Kanji word fails. What gives?

dev=> select 'foo bar' ~ '\ybar\y' v;
 v
---
 t
(1 row)

dev=> select '積極的 積極的' ~ '\y積極的\y' v;
 v
---
 f
(1 row)
user2297550
  • 3,142
  • 3
  • 28
  • 39
  • 1
    http://stackoverflow.com/questions/280712/javascript-unicode-regexes (Javascript, but I'm pretty sure that the issue is the same.) – Dmitri Goldring May 07 '17 at 15:56
  • 1
    [From the docs](https://www.postgresql.org/docs/current/static/functions-matching.html): *A word is defined as a sequence of word characters that is neither preceded nor followed by word characters. A word character is an alnum character (**as defined by ctype**) or an underscore.* -- [try with](http://rextester.com/QMJE27595) `COLLATE "zh_CN"` – pozs May 08 '17 at 12:52

1 Answers1

1

It'll work without enclosing by \y:

SELECT '積極的 積極的' ~ '積極的' AS v;
 v 
---
 t
(1 row)

regexp_matches will work too:

SELECT regexp_matches('積極的 積極的', '^.*(積極的).*$') AS v;
    v     
----------
 {積極的}
(1 row)

[UPDATE]

Contemporary Chinese characters are rendered using Unicode, which not all programming platforms fully support when it comes to regex word boundaries. I suppose PostgreSQL isn't using a regex engine that supports Unicode word boundaries.

Some programming languages like Scala (Java as well) do support Unicode with word boundaries:

scala> """\b積極的\b""".r findFirstIn "積極的 積極的"
res4: Option[String] = Some(積極的)

Note that \b, not \y, is used for word boundaries in Scala/Java.

Leo C
  • 22,006
  • 3
  • 26
  • 39