2

I'm working on a membership administration program, for wich we want to use Elasticsearch as search engine. At this point we're having problems with indexing certain fields, because they generate an 'immense term'-error on the _all field.

Our settings:

curl -XGET 'http://localhost:9200/my_index?pretty=true'
{
  "my_index" : {
    "aliases" : { },
    "mappings" : {
      "Memberships" : {
        "_all" : {
          "analyzer" : "keylower"
        },
        "properties" : {
          "Amount" : {
            "type" : "float"
          },
          "Members" : {
            "type" : "nested",
            "properties" : {
              "Startdate membership" : {
                "type" : "date",
                "format" : "dateOptionalTime"
              },
              "Enddate membership" : {
                "type" : "date",
                "format" : "dateOptionalTime"
              },
              "Members" : {
                "type" : "string",
                "analyzer" : "keylower"
              }
            }
          },
          "Membership name" : {
            "type" : "string",
            "analyzer" : "keylower"
          },
          "Description" : {
            "type" : "string",
            "analyzer" : "keylower"
          },
          "elementId" : {
            "type" : "integer"
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1441310632366",
        "number_of_shards" : "1",
        "analysis" : {
          "filter" : {
            "my_char_filter" : {
              "type" : "asciifolding",
              "preserve_original" : "true"
            }
          },
          "analyzer" : {
            "keylower" : {
              "filter" : [ "lowercase", "my_char_filter" ],
              "tokenizer" : "keyword"
            }
          }
        },
        "number_of_replicas" : "1",
        "version" : {
          "created" : "1040599"
        },
        "uuid" : "nn16-9cTQ7Gn9NMBlFxHsw"
      }
    },
    "warmers" : { }
  }
}

We use the keylower-analyzer, because we don't want the fullname to be split on whitespace. This is because we want to be able to search on 'john johnson' in the _all field as well as in the 'Members'-field.

The 'Members'-field can contain multiple members, wich is where the problems start. When the field only contains a couple of members (as in the example below), there is no problem. However, the field may contain hundreds or thousands of members, wich is when we get the immens term error.

curl 'http://localhost:9200/my_index/_search?pretty=true&q=*:*'
{  
   "took":1,
   "timed_out":false,
   "_shards":{  
      "total":1,
      "successful":1,
      "failed":0
   },
   "hits":{  
      "total":1,
      "max_score":1.0,
      "hits":[  
         {  
            "_index":"my_index",
            "_type":"Memberships",
            "_id":"15",
            "_score":1.0,
            "_source":{  
               "elementId":[  
                  "15"
               ],
               "Membership name":[  
                  "My membership"
               ],
               "Amount":[  
                  "100"
               ],
               "Description":[  
                  "This is the description."
               ],
               "Members":[  
                  {  
                     "Members":"John Johnson",
                     "Startdate membership":"2015-01-09",
                     "Enddate membership":"2015-09-03"
                  },
                  {  
                     "Members":"Pete Peterson",
                     "Startdate membership":"2015-09-09"
                  },
                  {  
                     "Members":"Santa Claus",
                     "Startdate membership":"2015-09-16"
                  }
               ]
            }
         }
      ]
   }
}

NOTE: The above example works! It's only when the field 'Members' contains (a lot) more members that we get the error. The error we get is:

"error":"IllegalArgumentException[Document contains at least one immense term in field=\"_all\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[...]...', original message: bytes can be at most 32766 in length; got 106807]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 106807]; " "status":500

We only get this error on the _all-field, not on the original Members-field. With ignore_above, it's not possible to search in the _all field on fullname anymore. With the standard analyzer, i would find this document if i would search on 'Santa Johnson', because the _all-fields has a token 'Santa' and 'Johnson'. That's why i use keylower for these fields.

What i would like is an analyzer that tokenizes on field, but doesn't break up the values in the fields itself. What happens now, is that the entire field 'Members' is being fed as one token, including the childfields. (so, the token in the example above would be:

  • John Johnson 2015-01-09 2015-09-03 Pete Peterson 2015-09-09 Santa Claus 2015-09-16

Is it possible to tokenize these fields in such a way that every field is being fed to _all as separate tokens, but without breaking up the values in the fields themself? So that the tokens would be:

  • John Johnson
  • 2015-01-09
  • 2015-09-03
  • Pete Peterson
  • 2015-09-09
  • Santa Claus
  • 2015-09-16

Note: We use the Elasticsearch php library.

wjhulzebosch
  • 175
  • 3
  • 11
  • [This answer](http://stackoverflow.com/questions/24019868/utf8-encoding-is-longer-than-the-max-length-32766) should help you. – Val Sep 03 '15 at 20:26
  • Just want to add to what @Val said. There are better performing alternatives than treating the `_all` field as a single non-tokenized blob and using very expensive substring search. For example, you can use standard analyzer on _all field and then search for the phrase "John Johnson". Or you can index _all using single filter and search for 2 or 3 word phrases with normal match query. Perhaps, if you can update your question with more complete requirements we would be able to suggest you a better and more performant solution. – imotov Sep 03 '15 at 21:16
  • @imotov I've edited the question to include our reasons for not using the default analyzer and ignore_above. – wjhulzebosch Sep 04 '15 at 14:58
  • @wjhulzebosch you didn't answer why you cannot use standard analyzer and [phrase search](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html#_phrase) – imotov Sep 04 '15 at 15:44
  • @imotov If my understanding of the match/phrase query is correct, i did: i don't want to find this document when searching for 'Santa Johnson', but when i use the default analyzer and then perform this search, this document will pop-up, because it has a token 'Santa' and 'Johnson'. Maybe i'm wrong at this point? – wjhulzebosch Sep 04 '15 at 16:10

1 Answers1

3

There is a much better way of doing this. Whether or not the phrase search can span multiple field values is determined by position_offset_gap (in 2.0 it will be renamed into position_increment_gap). This parameter basically specifies how many words/positions should be "inserted" between the last token of one field and the first token of the following fields. By default, in elasticsearch prior to 2.0 position_increment_gap has value of 0. That's is what causing the issues that you describe.

By combining copy_to feature and specifying position_increment_gap you can create an alternative my_all field that will not have this issue. By setting this new field in index.query.default_field setting you can tell elasticsearch to use this field by default instead of _all field when no fields are specified.

curl -XDELETE "localhost:9200/test-idx?pretty"
curl -XPUT "localhost:9200/test-idx?pretty" -d '{
    "settings" :{
        "index": {
            "number_of_shards": 1,
            "number_of_replicas": 0,
            "query.default_field": "my_all"
        }
    },
    "mappings": {
        "doc": {
            "_all" : {
                "enabled" : false
            },
            "properties": {
                "Members" : {
                  "type" : "nested",
                  "properties" : {
                    "Startdate membership" : {
                      "type" : "date",
                      "format" : "dateOptionalTime",
                      "copy_to": "my_all"
                    },
                    "Enddate membership" : {
                      "type" : "date",
                      "format" : "dateOptionalTime",
                      "copy_to": "my_all"
                    },
                    "Members" : {
                      "type" : "string",
                      "analyzer" : "standard",
                      "copy_to": "my_all"
                    }
                  }
                },
                "my_all" : {
                    "type": "string",
                    "position_offset_gap": 256
                }
            }
        }
    }
}'
curl -XPUT "localhost:9200/test-idx/doc/1?pretty" -d '{
    "Members": [{
        "Members": "John Johnson",
        "Startdate membership": "2015-01-09",
        "Enddate membership": "2015-09-03"
    }, {
        "Members": "Pete Peterson",
        "Startdate membership": "2015-09-09"
    }, {
        "Members": "Santa Claus",
        "Startdate membership": "2015-09-16"
    }]
}'
curl -XPOST "localhost:9200/test-idx/_refresh?pretty"
echo
echo "Should return one hit"
curl "localhost:9200/test-idx/doc/_search?pretty=true" -d '{
    "query": {
        "match_phrase" : {
            "my_all" : "John Johnson"
        }
    }
}'
echo
echo "Should return one hit"
curl "localhost:9200/test-idx/doc/_search?pretty=true" -d '{
    "query": {
        "query_string" : {
            "query" : "\"John Johnson\""
        }
    }
}'
echo
echo "Should return no hits"
curl "localhost:9200/test-idx/doc/_search?pretty=true" -d '{
    "query": {
        "match_phrase" : {
            "my_all" : "Johnson 2015-01-09"
        }
    }
}'
echo
echo "Should return no hits"
curl "localhost:9200/test-idx/doc/_search?pretty=true" -d '{
    "query": {
        "query_string" : {
            "query" : "\"Johnson 2015-01-09\""
        }
    }
}'
echo
echo "Should return no hits"
curl "localhost:9200/test-idx/doc/_search?pretty=true" -d '{
    "query": {
        "match_phrase" : {
            "my_all" : "Johnson Pete"
        }
    }
}'
imotov
  • 28,277
  • 3
  • 90
  • 82
  • Hi, i've just tested this and this seems to work correctly. The only problem i've encountered, is that i can't use match_phrase with a wildcard (for instance, when i want to search on "Santa Cl*". I know this is actually a separate issue, but for us it's important that this is possible. I haven't found a combination of match_phrase with wildcard. Is this possible or should i use nGrams for this? – wjhulzebosch Sep 08 '15 at 19:05
  • I think what you are looking for here is [match_phrase_prefix](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html#_match_phrase_prefix). There are some caveats there with massive wildcard expansions, but that's definitely too juicy to discuss in comments. – imotov Sep 08 '15 at 19:22