How do I force tika server to exclude the TesseractOCRParser using curl

Question

I'm running tika-server-1.23.jar with tesseract and extracting text from files using curl via php. Sometimes it takes too long to run with OCR so I'd like, occasionally, to exclude running tesseract. I can do this by inserting

<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>

in the tika config xml file but this means it never runs tesseract.

Can I force the tika server to skip using tesseract selectively at each request via curl and, if so, how?

I've got a workaround where I'm running two instances of the tika server each with a different config file listening on different ports but this is sub-optimal.

Thanks in advance.

Dave Meikle · Accepted Answer · 2020-12-04T22:21:42.387

You can set the OCR strategy using headers for PDF files, which includes an option not to OCR:

curl -T test.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: no_ocr"

There isn't really an equivalent for other file types, but there is a similar header prefix call X-Tika-OCR that allows you to set configuration on the TesseractOCRConfig instance when used on any file type.

You have some options which could be of interest in your scenario:

maxFileSizeToOcr - which you could set to 0
timeout - which you could set to the timeout you are willing to give
tesseractPath - which you can set to anything, as if it can't find it, it can't execute

So, for example, if you want to skip a file you could set the max file size to 0 which means it will not be processed:

curl -T testOCR.jpg http://localhost:9998/tika  --header "X-Tika-OCRmaxFileSizeToOcr: 0"

Or set the path to /dummy:

curl -T testOCR.jpg http://localhost:9998/tika  --header "X-Tika-OCRtesseractPath: /dummy"

You can of course also use these headers with PDF files too, should you wish.

On the back of your comment I've also found this which has some more information on what can be included in the headers and the settings. Thanks again. https://stackoverflow.com/questions/62011038/apache-tika-server-request-header-parameters — Adam, Dec 07 '20 at 16:59

How do I force tika server to exclude the TesseractOCRParser using curl

1 Answers1