UTF8 encoding is longer than the max length 32766

Question

I've upgraded my Elasticsearch cluster from 1.1 to 1.2 and I have errors when indexing a somewhat big string.

{
  "error": "IllegalArgumentException[Document contains at least one immense term in field=\"response_body\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[7b 22 58 48 49 5f 48 6f 74 65 6c 41 76 61 69 6c 52 53 22 3a 7b 22 6d 73 67 56 65 72 73 69]...']",
  "status": 500
}

The mapping of the index :

{
  "template": "partner_requests-*",
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "request": {
      "properties": {
        "asn_id": { "index": "not_analyzed", "type": "string" },
        "search_id": { "index": "not_analyzed", "type": "string" },
        "partner": { "index": "not_analyzed", "type": "string" },
        "start": { "type": "date" },
        "duration": { "type": "float" },
        "request_method": { "index": "not_analyzed", "type": "string" },
        "request_url": { "index": "not_analyzed", "type": "string" },
        "request_body": { "index": "not_analyzed", "type": "string" },
        "response_status": { "type": "integer" },
        "response_body": { "index": "not_analyzed", "type": "string" }
      }
    }
  }
}

I've searched the documentation and didn't find anything related to a maximum field size. According to the core types section I don't understand why I should "correct the analyzer" for a not_analyzed field.

Also relevant to other software using Lucene index, like Solr. — Mikko Koho, Nov 23 '15 at 08:27
How can the prefix value be used to say, decode into text? I mean, how to interpret this value? — User3518958, Sep 17 '19 at 11:26

score 69 · Accepted Answer · edited May 22 '15 at 19:05

69

So you are running into an issue with the maximum size for a single term. When you set a field to not_analyzed it will treat it as one single term. The maximum size for a single term in the underlying Lucene index is 32766 bytes, which is I believe hard coded.

Your two primary options are to either change the type to binary or to continue to use string but set the index type to "no".

edited May 22 '15 at 19:05

falstro

34,597
9
72
86

answered Jun 03 '14 at 18:07

John Petrone

26,943
6
63
68

That's exactly the conclusion Karmi (from Elasticsearch core team) helped me arrive to. As I've said, I've settled for `"index": "no"`. – jlecour Jun 03 '14 at 19:09
FYI this is something that recently changed in the underlying lucene, which was previously silently ignoring those immense terms, while now it's throwing exception. – javanna Jun 05 '14 at 20:41
Well on ES 5.4.1 - you still have to disable doc_values also – prikha Jun 23 '17 at 11:07

Mikael M · Answer 2 · 2022-05-17T09:03:44.797

If you really want not_analyzed on on the property because you want to do some exact filtering then you can use "ignore_above": 256

Here is an example of how I use it in php:

    'mapping'    => [
        'type'   => 'multi_field',
        'path'   => 'full',
        'fields' => [
            '{name}' => [
                'type'     => 'string',
                'index'    => 'analyzed',
                'analyzer' => 'standard',
            ],
            'raw' => [
                'type'         => 'string',
                'index'        => 'not_analyzed',
                'ignore_above' => 256,
            ],
        ],
    ],

In your case you probably want to do as John Petrone told you and set "index": "no" but for anyone else finding this question after, like me, searching on that Exception then your options are:

set "index": "no"
set "index": "analyze"
set "index": "not_analyzed" and "ignore_above": 256

It depends on if and how you want to filter on that property.

This is the best solution if you need to still sort on the field. — Tjorriemorrie, Sep 22 '15 at 11:37
Is worth mentioning, if data already exists in the type, best is to use `ignore_above` option which is a non-breaking change as opposed to `index:no` which is a breaking change and you'd have to export-import data to apply. — Ean V, Jul 13 '16 at 04:45

Jasper Huzen · Answer 3 · 2015-10-25T21:02:09.650

There is a better option than the one John posted. Because with that solution you can't search on the value anymore.

Back to the problem:

The problem is that by default field values will be used as a single term (complete string). If that term/string is longer than the 32766 bytes it can't be stored in Lucene .

Older versions of Lucene only registers a warning when terms are too long (and ignore the value). Newer versions throws an Exception. See bugfix: https://issues.apache.org/jira/browse/LUCENE-5472

Solution:

The best option is to define a (custom) analyzer on the field with the long string value. The analyzer can split out the long string in smaller strings/terms. That will fix the problem of too long terms.

Don't forget to also add an analyzer to the "_all" field if you are using that functionality.

Analyzers can be tested with the REST api. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

very interesting answer, but where to find such custom analyzer that will split string field into tokens, but "combine" these tokens whole value when during search request? — Cherry, May 05 '15 at 04:08
You can use the default/custom analyzers found on http://www.elastic.co/guide/en/elasticsearch/reference/1.5/analysis-analyzers.html . Don't forget the _all field that can use/require an analyzer too (or disable field to fix problem). — Jasper Huzen, May 06 '15 at 07:29

score 2 · Answer 4 · answered Jun 03 '14 at 16:47

2

I needed to change the index part of the mapping to no instead of not_analyzed. That way the value is not indexed. It remains available in the returned document (from a search, a get, …) but I can't query it.

answered Jun 03 '14 at 16:47

jlecour

2,905
1
25
24

@Adrian As I've said, I've changed `mapping.request.properties.request_body.index` from `not_analyzed` to `no` and updated the mapping of my index in Elasticsearch. – jlecour Jul 24 '14 at 10:38

Brandon Kearby · Answer 5 · 2018-07-11T19:50:33.343

One way of handling tokens that are over the lucene limit is to use the truncate filter. Similar to ignore_above for keywords. To demonstrate, I'm using 5. Elasticsearch suggests to use ignore_above = 32766 / 4 = 8191 since UTF-8 characters may occupy at most 4 bytes. https://www.elastic.co/guide/en/elasticsearch/reference/6.3/ignore-above.html

curl -H'Content-Type:application/json' localhost:9200/_analyze -d'{
  "filter" : [{"type": "truncate", "length": 5}],
  "tokenizer": {
    "type":    "pattern"
  },
  "text": "This movie \n= AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}'

Output:

{
  "tokens": [
    {
      "token": "This",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "movie",
      "start_offset": 5,
      "end_offset": 10,
      "type": "word",
      "position": 1
    },
    {
      "token": "AAAAA",
      "start_offset": 14,
      "end_offset": 52,
      "type": "word",
      "position": 2
    }
  ]
}

How can we use it with keyword field? – Fayaz Nov 13 '18 at 09:02 — Fayaz, Nov 13 '18 at 09:02
I would just use ignore_above for a keyword field. – Brandon Kearby Nov 13 '18 at 16:41 — Brandon Kearby, Nov 13 '18 at 16:41

Jeremy Lynch · Answer 6 · 2016-12-05T03:16:29.070

1

If you are using searchkick, upgrade elasticsearch to >= 2.2.0 & make sure you are using searchkick 1.3.4 or later.

This version of searchkick sets ignore_above = 256 by default, thus you won't get this error when UTF > 32766.

This is discussed here.

edited Dec 05 '16 at 03:16

answered Sep 12 '16 at 06:41

Jeremy Lynch

6,780
3
52
63

score 1 · Answer 7 · answered Oct 31 '17 at 04:31

Using logstash to index those long messages, I use this filter to truncate the long string :

    filter {
        ruby {
            code => "event.set('message_size',event.get('message').bytesize) if event.get('message')"
        }
        ruby {
            code => "
                if (event.get('message_size'))
                    event.set('message', event.get('message')[0..9999]) if event.get('message_size') > 32000
                    event.tag 'long message'  if event.get('message_size') > 32000
                end
            "
         }
     }

It adds a message_size field so that I can sort the longest messages by size.

It also adds the long message tag to those that are over 32000kb so I can select them easily.

It doesn't solve the problem if you intend to index those long messages completely, but if, like me, don't want to have them in elasticsearch in the first place and want to track them to fix it, it's a working solution.

score 0 · Answer 8 · answered Mar 01 '16 at 23:54

0

I got around this problem by changing my analyzer .

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "standard" : {
                    "tokenizer": "standard",
                    "filter": ["standard", "lowercase", "stop"]
                }
            }
        }
    }
}

answered Mar 01 '16 at 23:54

Raghu K Nair

3,854
1
28
45

1

How does this help to ignore long tokens? – animageofmine Mar 31 '17 at 04:47

score 0 · Answer 9 · edited Oct 13 '17 at 06:31

0

In Solr v6+ I changed the field type to text_general and it solved my problem.

<field name="body" type="string" indexed="true" stored="true" multiValued="false"/>   
<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>

edited Oct 13 '17 at 06:31

Shiladitya

12,003
15
25
38

answered Oct 13 '17 at 06:12

sajju

11
1

score 0 · Answer 10 · answered Feb 09 '18 at 17:14

I've stumbled upon the same error message with Drupal's Search api attachments module:

Document contains at least one immense term in field="saa_saa_file_entity" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms.

Changing the fields type from string to Fulltext (in /admin/config/search/search-api/index/elastic_index/fields) solved the problem for me.

UTF8 encoding is longer than the max length 32766

10 Answers10

Linked