I am trying to extract the text content of a PDF using the Apache Tika integration on Apache ManifoldCF, in order to ingest some PDF files on my Laptop in an Elasticsearch server.
After properly creating the Tika Transformer and configuring it inside my job, I see that the resulting field "_content" on ES is filled with the binary encoding of the file, and not the text.
I saw also this :Extract file content with ManifoldCF, But still no answer has been provided (since 2015!).
Can anybody help me?
Thanks!