1

I am busy creating a document search. The main idea is that documents are read (using Tika) and then added to the index to create a full-text document search.

A lot of the documents are quite large, and whenever I try to index them I get an error:

IllegalArgumentException[Document contains at least one immense term in field\"<field>\" (whose UTF8 encoding is larger than the max length 32766), 

same as in this thread: UTF8 encoding is longer than the max length 32766

Other than limiting the actual String passed to ElasticSearch, another suggestion was to create custom analyzer for that specific field. I am thus trying to create one such analyzer, but as I am quite new to ES, I can't quite figure out how. Sadly the documentation doesn't help much on this.

I don't need a specific analyzer (unless you got a good one for large string), but only some help on how to assign this custom analyzer to specific field.

Community
  • 1
  • 1
randyr
  • 1,679
  • 1
  • 11
  • 17

1 Answers1

0

This has been a while ago now, so I don't remember everything, but here goes.

The UTF8 encoding is longer than the max length 32766 issue I ran into was caused because of a flag that had been set. This caused the entire string to not be analyzed at all, and therefore ElasticSearch was treating it as one single term. Apache Lucene (the engine underneath ElasticSearch) has the 32766 value as a max term length. If you give a single term longer than this, it will throw this error.

Writing a custom analyzed could definitely solve the issue, but having the default analyzer handle it was enough for my use case. By setting a certain flag (sort = false) in our own code, I was able to turn the default analyzer back on for the string that I send in.

Other experiences

You will run into faulty PDF's. A lot. This will cause issues with Apache Tika such as Zip bomb errors. These are often caused by deeply nested XML in the PDF.

Also, don't underestimate the amount of PDF's created using OCR. Although the PDF might look good normally, the actual text could be completely nonsensical. A quick way to check this, is just copy the text from the PDF into notepad, and check if it's correct.

Prepare enough RAM for this. Some single documents could sometimes ramp up the RAM usage of the program by 1-2 GB. How much of this was actually in use, and not just waiting to be GC'd, I don't know.

Choose which files you actually want to parse. For example, there might be no useful reason to parse XML files.

Scanning larger amount of documents takes a long time. It might be best to split up the process into updating and indexing. This way, you could limit the amount of documents scanned per day by checking if the document has already been indexed. If it has not, index it. If it has changed, update it. I found that in our case it took about 4 hours to scan ~80000 documents. This was done with a single CPU and about 2 GB RAM.

Hope this helped even a little.

randyr
  • 1,679
  • 1
  • 11
  • 17