I am doing OCR to a PDF file using Apache TIKA Server.
I am interested in the hOCR output, but only succeed to get the output in plain text format.
Following the wiki and the code, I am trying to configure Tesseract using X-Tika-OCR...
HTTP headers. In this case, I am using the X-Tika-OCRoutputType: hocr
HTTP header, but I get the plain text output or html output without HOCR tags.
I tried both the /tika
and /rmeta
endpoints.
The curl
commands I use:
curl -v -X PUT --data-binary @file.pdf \
"http://tika-server:8081/tika" \
-H "Content-Type: application/pdf" \
-H "X-Tika-OCRoutputType: hocr"
curl -v -X PUT --data-binary @file.pdf \
"http://tika-server:8081/rmeta" \
-H "Content-Type: application/pdf" \
-H "X-Tika-OCRoutputType: hocr"
I also tried setting the Accept
header to text/plain, text/html text/xhtml and text/hocr. None works. The last one gets an error.
I am using:
- Apache Tika 1.22
- Tesseract 4.1.0-3.1.x86_64
- RedHat 7