This has been a while ago now, so I don't remember everything, but here goes.
The UTF8 encoding is longer than the max length 32766
issue I ran into was caused because of a flag that had been set. This caused the entire string to not be analyzed at all, and therefore ElasticSearch was treating it as one single term. Apache Lucene (the engine underneath ElasticSearch) has the 32766 value as a max term length. If you give a single term longer than this, it will throw this error.
Writing a custom analyzed could definitely solve the issue, but having the default analyzer handle it was enough for my use case. By setting a certain flag (sort = false
) in our own code, I was able to turn the default analyzer back on for the string that I send in.
Other experiences
You will run into faulty PDF's. A lot. This will cause issues with Apache Tika such as Zip bomb
errors. These are often caused by deeply nested XML in the PDF.
Also, don't underestimate the amount of PDF's created using OCR. Although the PDF might look good normally, the actual text could be completely nonsensical. A quick way to check this, is just copy the text from the PDF into notepad, and check if it's correct.
Prepare enough RAM for this. Some single documents could sometimes ramp up the RAM usage of the program by 1-2 GB. How much of this was actually in use, and not just waiting to be GC
'd, I don't know.
Choose which files you actually want to parse. For example, there might be no useful reason to parse XML files.
Scanning larger amount of documents takes a long time. It might be best to split up the process into updating and indexing. This way, you could limit the amount of documents scanned per day by checking if the document has already been indexed. If it has not, index it. If it has changed, update it. I found that in our case it took about 4 hours to scan ~80000 documents. This was done with a single CPU and about 2 GB RAM.
Hope this helped even a little.