Elasticsearch - River and nGrams

Question

I am using ES with the river plugin, as I am using a couchDB and I am trying to use nGrams for my queries. I have done basically everything I need except for the fact that when someone inputs a space, the query doesn't work properly. That is because ES tokenizes every element of the query splitting it by the space.

Here is what I need to do:

Query for part of a text in a string:

query: "Hello Wor" response: "Hello World, Hello Word" / excluded "Hello, World, Word"
Sort results by criteria I specify;
Case insensitive.

Here is what I have done, following this question: How to search for a part of a word with ElasticSearch

curl -X PUT  'localhost:9200/_river/myDB/_meta' -d '
{
"type" : "couchdb",
"couchdb" : {
    "host" : "localhost",
    "port" : 5984,
    "db" : "myDB",
    "filter" : null
},
   "index" : {
    "index" : "myDB",
    "type" : "myDB",
    "bulk_size" : "100",
    "bulk_timeout" : "10ms",
    "analysis" : {
               "index_analyzer" : {
                          "my_index_analyzer" : {
                                        "type" : "custom",
                                        "tokenizer" : "standard",
                                        "filter" : ["lowercase", "mynGram"]
                          }
               },
               "search_analyzer" : {
                          "my_search_analyzer" : {
                                        "type" : "custom",
                                        "tokenizer" : "standard",
                                        "filter" : ["standard", "lowercase", "mynGram"]
                          }
               },
               "filter" : {
                        "mynGram" : {
                                   "type" : "nGram",
                                   "min_gram" : 2,
                                   "max_gram" : 50
                        }
               }
    }
}
}
'

I will then add a mapping for the sorting:

curl -s -XGET 'localhost:9200/myDB/myDB/_mapping' 
{
"sorting": {
       "Title": {
            "fields": {
                "Title": {
                     "type": "string"
                  }, 
                "untouched": {
                    "include_in_all": false, 
                    "index": "not_analyzed", 
                    "type": "string"
                    }
               }, 
              "type": "multi_field"
         },
        "Year": {
              "fields": {
                   "Year": {
                       "type": "string"
                       }, 
                       "untouched": {
                           "include_in_all": false, 
                           "index": "not_analyzed", 
                           "type": "string"
                         }
                     }, 
                    "type": "multi_field"
        }
     }
    }
   }'

I have added all the info I use just to be complete. Anyway, with this setup, that I suppose should work, whenever I try to get some results, the space is still used for splitting my query, example:

  http://localhost:9200/myDB/myDB/_search?q=Title:(Hello%20Wor)&pretty=true

Returns anything that contains "Hello" and "Wor" (I normally don't use the parentheses, but I have seen them in an example, still the results seem very similar).

Any help is truly appreciated as this is bugging me quite a lot.

UPDATE: At the end, I realized that I didn't need a nGram. A normal index would do; simply replacing the whitespace of the query with ' AND ' would do the job.

Example:

 Query: "Hello World"  --->  Replaced as "(*Hello And World*)"

The problems with NGrams was just that you were tokenizing on whitespaces. I guess you could have used the Keyword Tokenizer instead of the Standard one. — javanna, Oct 28 '12 at 07:26
Be careful with wildcards. It can slow your queries! Ngram are IMHO a better option. — dadoonet, Oct 28 '12 at 15:07

score 1 · Answer 1 · answered Oct 27 '12 at 20:32

don't have elastic search setup now, but maybe this helps from doc?

http://www.elasticsearch.org/guide/reference/query-dsl/match-query.html

Types of Match Queries

boolean

The default match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. The operator flag can be set to or or and to control the boolean clauses (defaults to or).

The analyzer can be set to control which analyzer will perform the analysis process on the text. It default to the field explicit mapping definition, or the default search analyzer.

fuzziness can be set to a value (depending on the relevant type, for string types it should be a value between 0.0 and 1.0) to constructs fuzzy queries for each term analyzed. The prefix_length and max_expansions can be set in this case to control the fuzzy process. If the fuzzy option is set the query will use constant_score_rewrite as its rewrite method the rewrite parameter allows to control how the query will get rewritten.

Here is an example when providing additional parameters (note the slight change in structure, message is the field name):

{
    "match" : {
        "message" : {
            "query" : "this is a test",
            "operator" : "and"
        }
    }
}

Elasticsearch - River and nGrams

1 Answers1