I'm working on a membership administration program, for wich we want to use Elasticsearch as search engine. At this point we're having problems with indexing certain fields, because they generate an 'immense term'-error on the _all field.
Our settings:
curl -XGET 'http://localhost:9200/my_index?pretty=true'
{
"my_index" : {
"aliases" : { },
"mappings" : {
"Memberships" : {
"_all" : {
"analyzer" : "keylower"
},
"properties" : {
"Amount" : {
"type" : "float"
},
"Members" : {
"type" : "nested",
"properties" : {
"Startdate membership" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"Enddate membership" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"Members" : {
"type" : "string",
"analyzer" : "keylower"
}
}
},
"Membership name" : {
"type" : "string",
"analyzer" : "keylower"
},
"Description" : {
"type" : "string",
"analyzer" : "keylower"
},
"elementId" : {
"type" : "integer"
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1441310632366",
"number_of_shards" : "1",
"analysis" : {
"filter" : {
"my_char_filter" : {
"type" : "asciifolding",
"preserve_original" : "true"
}
},
"analyzer" : {
"keylower" : {
"filter" : [ "lowercase", "my_char_filter" ],
"tokenizer" : "keyword"
}
}
},
"number_of_replicas" : "1",
"version" : {
"created" : "1040599"
},
"uuid" : "nn16-9cTQ7Gn9NMBlFxHsw"
}
},
"warmers" : { }
}
}
We use the keylower-analyzer, because we don't want the fullname to be split on whitespace. This is because we want to be able to search on 'john johnson' in the _all field as well as in the 'Members'-field.
The 'Members'-field can contain multiple members, wich is where the problems start. When the field only contains a couple of members (as in the example below), there is no problem. However, the field may contain hundreds or thousands of members, wich is when we get the immens term error.
curl 'http://localhost:9200/my_index/_search?pretty=true&q=*:*'
{
"took":1,
"timed_out":false,
"_shards":{
"total":1,
"successful":1,
"failed":0
},
"hits":{
"total":1,
"max_score":1.0,
"hits":[
{
"_index":"my_index",
"_type":"Memberships",
"_id":"15",
"_score":1.0,
"_source":{
"elementId":[
"15"
],
"Membership name":[
"My membership"
],
"Amount":[
"100"
],
"Description":[
"This is the description."
],
"Members":[
{
"Members":"John Johnson",
"Startdate membership":"2015-01-09",
"Enddate membership":"2015-09-03"
},
{
"Members":"Pete Peterson",
"Startdate membership":"2015-09-09"
},
{
"Members":"Santa Claus",
"Startdate membership":"2015-09-16"
}
]
}
}
]
}
}
NOTE: The above example works! It's only when the field 'Members' contains (a lot) more members that we get the error. The error we get is:
"error":"IllegalArgumentException[Document contains at least one immense term in field=\"_all\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[...]...', original message: bytes can be at most 32766 in length; got 106807]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 106807]; " "status":500
We only get this error on the _all-field, not on the original Members-field. With ignore_above, it's not possible to search in the _all field on fullname anymore. With the standard analyzer, i would find this document if i would search on 'Santa Johnson', because the _all-fields has a token 'Santa' and 'Johnson'. That's why i use keylower for these fields.
What i would like is an analyzer that tokenizes on field, but doesn't break up the values in the fields itself. What happens now, is that the entire field 'Members' is being fed as one token, including the childfields. (so, the token in the example above would be:
- John Johnson 2015-01-09 2015-09-03 Pete Peterson 2015-09-09 Santa Claus 2015-09-16
Is it possible to tokenize these fields in such a way that every field is being fed to _all as separate tokens, but without breaking up the values in the fields themself? So that the tokens would be:
- John Johnson
- 2015-01-09
- 2015-09-03
- Pete Peterson
- 2015-09-09
- Santa Claus
- 2015-09-16
Note: We use the Elasticsearch php library.