2

I am using the Nest client against Elasticsearch. I am using an n-gram index analyzer. I am noticing some odd behavior - when I search for words from the beginning I am not getting any results. However, if I search from the second character on, it works perfectly. These are just normal English letters.

So, for instance, it will find words containing 'kitty' if I search for 'itty', 'itt', 'tty', etc. but not 'ki', 'kit', etc. It's almost like n-gram is just skipping over the first character.

I am not sure if this is being caused by Nest or if this is normal behavior for n-gram. My index settings look similar to those found in this post: Elasticsearch using NEST: How to configure analyzers to find partial words? except my max-gram is only 10.

Update

I simplified my code a little bit and verified the same behavior.

Here is the mapping configuration defined using Nest:

const string index = "myApp";
const string type = "account";
const string indexAnalyzer = "custom_ngram_analyser";
const string searchAnalyzer = "standard";
const string tokenizer = "custom_ngram_tokenizer";
const string tokenFilter = "custom_ngram_tokenFilter";
...
client.CreateIndex(index, i => i
        .Analysis(ad => ad
            .Analyzers(a => a.Add(indexAnalyzer, new CustomAnalyzer() { Tokenizer = tokenizer }))
            .Tokenizers(t => t.Add(tokenizer, new NGramTokenizer() { MinGram = 1, MaxGram = 15 }))
            .TokenFilters(f => f.Add(tokenFilter, new NgramTokenFilter() { MinGram = 1, MaxGram = 15 })))
        .TypeName(account);
        .IdField(r => r.SetPath("accountId").SetIndex("not_analyzed").SetStored(true));
        .Properties(ps => ps.Number(p => p.Name(r => r.AccountId)
                                          .Index(NonStringIndexOption.not_analyzed)
                                          .Store(true));
                            .String(p => p.Name(r => r.AccountName)
                                          .Index(FieldIndexOption.analyzed)
                                          .IndexAnalyzer(indexAnalyzer)
                                          .SearchAnalyzer(searchAnalyzer)
                                          .Store(true)
                                          .TermVector(TermVectorOption.no))));

And this is the search where the first character is missing:

SearchCriteria criteria = new SearchCriteria() { AccountName = "kitty" };

client.Search<SearchAccountResult>(s => s
    .Index(index)
    .Type(type)
    .Query(q => q.Bool(b => b.Must(d => d.Match(m => m.OnField(r => r.AccountName).QueryString(criteria.AccountName)))))
    .SortDescending("_score"))
Community
  • 1
  • 1
Travis Parks
  • 8,435
  • 12
  • 52
  • 85
  • Hey Travis Parks, would you mind posting your mapping and search query? – Greg Marzouka May 31 '14 at 13:11
  • 1
    @Greg Marzouka I updated my question. – Travis Parks Jun 01 '14 at 01:03
  • 1
    hmm, seems to work fine for me. Just to clarify, with the same code above `itty` will return results? Can you post your mapping from ES by doing a GET /myapp/_mapping? Also, if you do GET /myapp/_analyze?analyzer=custom_ngram_analyser&text=kitty, is `kitty` a token? One other thing- I'm assuming this is just your example, but `myApp` isn't lowercase, which ES will reject when creating the index. – Greg Marzouka Jun 01 '14 at 13:08
  • Itty is returning results. – Travis Parks Jun 01 '14 at 15:46
  • I ran `GET /myapp/_analyze?analyzer=custom_ngram_analyser&text=kitty` and `kitty` does appear. – Travis Parks Jun 01 '14 at 15:58
  • I went to go get my current `_mapping`. It was missing the "lowercase" token filter on the index analyzer. A little research and the code building the index was outdated. I removed my project reference and re-added it. Rebuilt and now the correct token filter shows up. This seems to have resolved my issue as well because my old test data included mixed cased words. All my kitties were capitalized. – Travis Parks Jun 01 '14 at 18:11
  • I also noticed that if I built my test data and immediately queried the data, I got back no results. I would only see results the second time around. I am guessing there is a delay between when the data is added/indexed and when it is visible. I didn't think a response was returned until after the value was indexed and replicated... guess not. – Travis Parks Jun 01 '14 at 18:14
  • 1
    Hey @Travis Parks, glad you've figured out the issue. FYI- documents aren't available for search after indexing until a [refresh](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-refresh.html#indices-refresh) is issued. This can be set to occur on your index at a certain interval (default is every 1 second), but you can also manually issue a refresh with some slight performance implications. You can however retrieve a document immediately after indexing with the [get API](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-get.html). – Greg Marzouka Jun 02 '14 at 03:15
  • @GregMarzouka Knowing about `refresh` is actually really useful. I am going to be playing around with a lot of test data, so not having to constantly run my tests twice will keep me sane. – Travis Parks Jun 02 '14 at 12:18
  • @TravisParks mind answering your own question? We would love to keep the answered ratio going on the `nest` tag :) – Martijn Laarman Jul 02 '14 at 20:15

1 Answers1

0

I was running into this issue because initially my index was case sensitive. All of my test data started with upper case letters.

I changed it to be case-insensitive, but the update did not take place immediately. Even though the analyzer appeared to be configured as case-insensitive, the index was not refreshed.

Wiping out the index and repopulating it from scratch fixed the issue.

Travis Parks
  • 8,435
  • 12
  • 52
  • 85