1

I'm running tika-server-1.23.jar with tesseract and extracting text from files using curl via php. Sometimes it takes too long to run with OCR so I'd like, occasionally, to exclude running tesseract. I can do this by inserting

<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>

in the tika config xml file but this means it never runs tesseract.

Can I force the tika server to skip using tesseract selectively at each request via curl and, if so, how?

I've got a workaround where I'm running two instances of the tika server each with a different config file listening on different ports but this is sub-optimal.

Thanks in advance.

Adam
  • 5,495
  • 2
  • 7
  • 24

1 Answers1

0

You can set the OCR strategy using headers for PDF files, which includes an option not to OCR:

curl -T test.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: no_ocr"

There isn't really an equivalent for other file types, but there is a similar header prefix call X-Tika-OCR that allows you to set configuration on the TesseractOCRConfig instance when used on any file type.

You have some options which could be of interest in your scenario:

  • maxFileSizeToOcr - which you could set to 0
  • timeout - which you could set to the timeout you are willing to give
  • tesseractPath - which you can set to anything, as if it can't find it, it can't execute

So, for example, if you want to skip a file you could set the max file size to 0 which means it will not be processed:

curl -T testOCR.jpg http://localhost:9998/tika  --header "X-Tika-OCRmaxFileSizeToOcr: 0"

Or set the path to /dummy:

curl -T testOCR.jpg http://localhost:9998/tika  --header "X-Tika-OCRtesseractPath: /dummy"

You can of course also use these headers with PDF files too, should you wish.

Dave Meikle
  • 226
  • 2
  • 5
  • The X-Tika-OCRmaxFileSizeToOcr: 0 works for me. Thanks. – Adam Dec 07 '20 at 14:42
  • 1
    On the back of your comment I've also found this which has some more information on what can be included in the headers and the settings. Thanks again. https://stackoverflow.com/questions/62011038/apache-tika-server-request-header-parameters – Adam Dec 07 '20 at 16:59