0

I am doing tests with elastic search in indexing wikipedia's topics.

Below my settings.

Results I expect is to have first result matching the exact string - especially if string is made by one word only.

Instead:

Searching for "g"

curl "http://localhost:9200/my_index/_search?q=name:g&pretty=True"

returns [Changgyeonggung, Lopadotemachoselachogaleokranioleipsanodrimhypotrimmatosilphioparaomelitokatakechymenokichlepikossyphophattoperisteralektryonoptekephalliokigklopeleiolagoiosiraiobaphetraganopterygon, ..] as first results (yes, serendipity time! that is a greek dish if you are curious [http://nifty.works/about/BgdKMmwV6B3r4pXJ/] :)

I thought because the results weight more "G" letters respect to other words.. but:

Searching for "google":

curl "http://localhost:9200/my_index/_search?q=name:google&pretty=True"

returns

[Googlewhack, IGoogle, Google+, Google, ..] as first results, and I would expect Google to be the first.

What is wrong in my settings for not hitting exact keyword if exists?

I used index and search analyzers for the reason suggested in this answer:[https://stackoverflow.com/a/15932838/305883]

Settings

# make index with mapping
curl -X PUT localhost:9200/test-ngram -d '
{
  "settings": {
      "analysis": {
          "analyzer": {
              "index_analyzer": {
                  "type" : "custom",
                  "tokenizer": "lowercase",
                  "filter": ["asciifolding", "title_ngram"]
              },
              "search_analyzer": {
                  "type": "custom",
                  "tokenizer": "standard",
                  "filter": ["standard", "lowercase", "stop", "asciifolding"]
              }
          },
      "filter": {
          "title_ngram" : {
            "type" : "nGram",
            "min_gram" : 1,
            "max_gram" : 10
            }
          }
      }
  },

  "mappings": {
    "topic": {
      "properties": {
        "name": {
          "type": "string",
          "boost": 10.0,
          "index": "analyzed",
          "index_analyzer": "index_analyzer",
          "search_analyzer": "search_analyzer"
        }
      }
    }
  }
}
'
Cœur
  • 37,241
  • 25
  • 195
  • 267
user305883
  • 1,635
  • 2
  • 24
  • 48

1 Answers1

1

That's because relevance works in a different way by default (check the part about TF/IDF https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html) If you want to have exact term match on the top of the results while also matching substrings etc, you need to index name as multifield like this:

"name": {
    "type": "string",
    "index": "analyzed",
    // other analyzer stuff here
    "fields": {
        "raw":   { "type": "string", "index": "not_analyzed" }
    }
}

Then in the boolean query you need to query both name and name.raw and boost results from name.raw

xeye
  • 1,250
  • 10
  • 15
  • Thank you for the link! So, does this solution actually create two indexes: One `analyzed` and the other `not_analyzed` ? ( I am trying to keep index size to minimum (4M strings <= 0.5 gb). With standard analyzer I get 0.3, with ngram 1.3gb. My need is to get words also in the middle of the strings. Now doing tests with fuzzy [https://www.elastic.co/blog/found-fuzzy-search] search. – user305883 Jul 03 '16 at 19:38
  • yep, you'll have two reverse indexes. – xeye Jul 03 '16 at 19:54