preserve_original the original token in elasticsearch

Question

I have a token filter and analyzer as follows. However, I can't get the original token to be preserved. For example, if I _analyze using the word : saint-louis , I get back only saintlouis, whereas I expected to get both saintlouis and saint-louis as I have my preserve_original set to true. The ES version i am using is 6.3.2 and Lucene version is 7.3.1

"analysis": {
  "filter": {
    "hyphenFilter": {
      "pattern": "-",
      "type": "pattern_replace",
      "preserve_original": "true",
      "replacement": ""
    }
  },
  "analyzer": {
    "whitespace_lowercase": {
      "filter": [
        "lowercase",
        "asciifolding",
        "hyphenFilter"
      ],
      "type": "custom",
      "tokenizer": "whitespace"
    }
  }
}

@OpsterElasticsearchNinja I switched to using a word delimiter token filter. I think the pattern_replace filter does not have a preserve_original flag supported . Atleast not in the version I am using . — Ram K, Mar 02 '20 at 18:34
Would you like to post the answer, it would help other community members — Amit, Mar 02 '20 at 19:04
@OpsterElasticsearchNinja i posted my answer. Feel free to modify any part if necessary. — Ram K, Mar 02 '20 at 19:24
Thanks, but I would advise you to provide the entire setting and mapping in JSON format so that everybody can test and use it. You can refer my https://stackoverflow.com/questions/60487022/search-in-elasticsearch-errors-when-applying-analyzer-filter/60487446#60487446 and https://stackoverflow.com/questions/60479170/elasticsearch-analyzer-to-remove-quoted-sentences/60483185#60483185 on how to provide them. — Amit, Mar 02 '20 at 19:33

score 3 · Accepted Answer · edited Mar 05 '20 at 15:49

So looks like preserve_original is not supported on pattern_replace token filters, at least not on the version I am using.

I made a workaround as follows:

Index Def

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "whitespace",
                    "type": "custom",
                    "filter": [
                        "lowercase",
                        "hyphen_filter"
                    ]
                }
            },
            "filter": {
                "hyphen_filter": {
                    "type": "word_delimiter",
                    "preserve_original": "true",
                    "catenate_words": "true"
                }
            }
        }
    }
}

This would, for example, tokenize a word like anti-spam to antispam(removed the hyphen), anti-spam(preserved the original), anti and spam.

Analyzer API to see generated tokens

POST /_analyze

{ "text": "anti-spam", "analyzer" : "my_analyzer" }

Output of analyze API ie. generated tokens

{
    "tokens": [
        {
            "token": "anti-spam",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "anti",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "antispam",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "spam",
            "start_offset": 5,
            "end_offset": 9,
            "type": "word",
            "position": 1
        }
    ]
}

Fixed the formatting issues in your mapping, as when I tried it was giving formatting errors in your mapping and properly code formatted your answer :-) and made it easy so that anybody can test follow the example. — Amit, Mar 05 '20 at 15:50

preserve_original the original token in elasticsearch

1 Answers1

Index Def

Analyzer API to see generated tokens

Output of analyze API ie. generated tokens