Apache ManifoldCF TIKA

Question

I am trying to extract the text content of a PDF using the Apache Tika integration on Apache ManifoldCF, in order to ingest some PDF files on my Laptop in an Elasticsearch server.

After properly creating the Tika Transformer and configuring it inside my job, I see that the resulting field "_content" on ES is filled with the binary encoding of the file, and not the text.

I saw also this :Extract file content with ManifoldCF, But still no answer has been provided (since 2015!).

Can anybody help me?

Thanks!

score 0 · Answer 1 · answered Jul 22 '18 at 20:23

0

In the output connector for elastic search what is the field name that you have specified for the content field?

Please provide a field name as well as max document size.

answered Jul 22 '18 at 20:23

Shashank Raj

25
1
12

Apache ManifoldCF TIKA

1 Answers1