3

In my analyzer, I have added the asciifolding filter. In most cases this works very well, but when working with the danish language, I would like to not normalize the øæå characters, since "rød" and "rod" are very different words.

We are using the hosted elastic cloud cluster, so if possible a solution that does not require any non-standard deployments through the cloud platform.

Is there any way to do asciifolding, but whitelist certain characters?

Currently running on ES version 6.8

mortenbock
  • 545
  • 5
  • 21

2 Answers2

2

You should probably be using the ICU Folding Token Filter.

From the documentation:

Case folding of Unicode characters based on UTR#30, like the ASCII-folding token filter on steroids.

It let's you do everything that the AsciiFolding filter does, but in addition to this, it also allows you to ignore a range of characters through the unicodeSetFilter property.

In this case, you want to ignore æ,ø,å,Æ,Ø,Å:

"unicodeSetFilter": "[^æøåÆØÅ]"

Complete example:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "danish_analyzer": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "danish_folding",
              "lowercase"
            ]
          }
        },
        "filter": {
          "danish_folding": {
            "type": "icu_folding",
            "unicodeSetFilter": "[^æøåÆØÅ]"
          }
        }
      }
    }
  }
}
Silas Hansen
  • 1,669
  • 2
  • 17
  • 23
  • This looks very promising, and the analysis-icu plugin is available as standard on the cloud platform. I will try this out. – mortenbock Feb 21 '20 at 14:48
  • I'm positive it will solve it for you. I can recommend the "scandinavian_normalization" filter as well (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-normalization-tokenfilter.html), if you care about other scandinavian countries being able to find your content using their own versions of ö/ø, ä/æ, etc, without destroying the meaning of the special characters. – Silas Hansen Apr 03 '20 at 06:56
1

As you are already using the ASCII folding token filter but as its a token filter, so it really can't filter out certain characters as analysis process consists of below three sequential steps:

  1. char filter (here you can filter or replace certain chars)
  2. tokenizer(this process generates tokens)
  3. token filter(can modify the tokens generated by tokenizer)

There is no out of the box solution, which could efficiently address your issue(just by not normalizing only a few chars).

Referring to the definitive guide to ES book article on this.

you can use preserve original parameter in the token filter which would preserve the original tokens at the same position, but this has an issue with less relevance and giving the exact match on the original word.

Hence in the same book and its advise to index the original meaning in a different fields and then use the multi_match query with most_fields and more information of this can be found in this .

Amit
  • 30,756
  • 6
  • 57
  • 88