Compare fullwidth and halfwidth japanese characters in mongodb by using collation and regex

Question

According to the MongoDB documentation and the ICU documentation it should be possible to ignore full-width and half-width difference in Japanese text by utilizing collation.

I tried the following;

{ locale: "ja", caseLevel:true, strength:1}

with different strength but none of them is working.

db.getCollection('mycollection')
        .find({"desc":/ﾊﾞﾝﾄﾞ/})
        .collation({ locale: "ja", caseLevel:true, strength:1})

This query cannot get result from the following document;

{
    "desc": "＊EGRパイプバンド外れ"
}

update

Found reason that in MongoDB regex cannot apply collation, so if I use certain match to perform query the result is perfect:

db.getCollection('mycollection')
        .find({"desc":"*EGRﾊﾟｲﾌﾟﾊﾞﾝﾄﾞ外れ???"})
        .collation({ locale: "ja", caseLevel:true, strength:1})

This query will return ＊EGRパイプバンド外れ this result.

But not if I use regex, any suggestion on it?

score 1 · Accepted Answer · answered Oct 15 '19 at 03:53

1

There is no way to make collate work with any regex find logic, since the regex script will override any collate definition, and only use the logic defined within itself, namely find any string that contains half-width ﾊﾞﾝﾄﾞ only.

The simplest way to achieve this is to add an extra logic before you send the search text into your MongoDB client, and duplicate the text into both half & full width. You can use some existing tool like this.

Then apply both half & full width search parameters into your find condition with $or;

db.mycollection.find({$or: [{"desc":/ﾊﾞﾝﾄﾞ/}, {"desc":/バンド/}]})

Same issue;

Use of collation in mongodb $regex

answered Oct 15 '19 at 03:53

buræquete

14,226
4
44
89

yea I have thought about it, it's my last choice because it cannot handle the mix situation such as `ﾊﾞンド`. – Jarvan Oct 17 '19 at 03:13
I have seen some chapters about `normalization` in ICU, and in the demo of ICU http://demo.icu-project.org/icu-bin/nbrowser, seems it's possible use normalization to standardize characters. But I cannot get it work in collation, is it possible to use? – Jarvan Oct 17 '19 at 03:19
@Jarvan yeah I've thought about that mixed case, but it is a terrible example, who would send such a mixed text? I've never seen that. But sadly there is no pure mongo solution afaik. You need something before the mongo call... I don't think there is any way to do modification to characters that will make them match both half & full width, especially within regex. – buræquete Oct 17 '19 at 03:24
@Jarvan if you'd like to utilize my solution & also cover the mixed case, you can generate an array of strings, essentially a combination of half & full width chars (like `バﾝﾄﾞ`, `ﾊﾞンﾄﾞ`, `ﾊﾞﾝド`, etc.), and `$or` them all, but that will be really slow in mongo I think. – buræquete Oct 17 '19 at 03:26
you are right it's a bad example. I guess this is the best we can do for now, it's really excessively design for such an unnormal case. Thanks for your help! – Jarvan Oct 17 '19 at 04:24
1

@Jarvan sorry, I wish I'd have given a full answer, but it is such a weird edge case really :( Please do check my other answers, there are some interesting ones regarding Japanese text :) – buræquete Oct 17 '19 at 06:09

Compare fullwidth and halfwidth japanese characters in mongodb by using collation and regex

1 Answers1