Apache Tika Output Format

Asked Jan 02 '17 at 13:46

Active Jan 02 '17 at 13:46

Viewed 340 times

I have an requirement where pdf files comes as an input and I have to read it and based some rules, I have to split each page of pdf. Rules will be drive based on data which will gets extracted from the given pdf.

I gone through with Apache Tika Toolkit which suppose to be build for such requirement, I believe. The data is getting extracted using this tool but in text format. I want the output back in pdf format. I am not sure whether its possible to not. Please suggest.

Thanks. Manish.

asked Jan 02 '17 at 13:46

Manish

1

You can do this with PDFBox alone unless you're using some unique tika features. See https://stackoverflow.com/questions/23813727/how-to-extract-text-from-a-pdf-file-with-apache-pdfbox (read both answers) and https://stackoverflow.com/questions/32259167/how-to-split-a-pdf-using-apache-pdfbox – Tilman Hausherr Jan 02 '17 at 15:03
Thanks for response. Actually why I am using Tika is because of OCR feature I want which I believe PDFBox doesn't have. – Manish Jan 03 '17 at 05:57
In that case, the answer is in https://stackoverflow.com/questions/32259167/how-to-split-a-pdf-using-apache-pdfbox only. Tika itself doesn't split, but PDFBox (which is a part of Tika) does. – Tilman Hausherr Jan 03 '17 at 09:37

Apache Tika Output Format

0 Answers0