0

I'm trying to extract text from a large pdf using this code(my file comes from a blob on azure and the pdf takes 7.3mb, it has got 140 pages and they are all images) and it's always reaching the timeout.

os.environ['TIKA_SERVER_ENDPOINT'] = 'http://0.0.0.0:9998/'

headers = {
    "X-Tika-OCRLanguage": "eng+nor",
    "X-Tika-PDFextractInlineImages": "true",  # run OCR against inline images
}

data = parser.from_buffer(
    buffer.readall(),
    xmlContent=True, 
    requestOptions={
        "headers": headers, 
        "timeout": 3600
   }
)

Is there any header I'm missing about to handle large files?

I'm using tika-server running it directly on a docker image with this command:

docker run -d -p 9998:9998 apache/tika:1.28.2-full

Thanks for your time!

Tau n Ro
  • 108
  • 8

2 Answers2

0

I think I've managed to solve the problem. I only needed to change the headers, for the moment it's working:

headers = {
    "X-Tika-OCRLanguage": "eng+nor",
    "X-Tika-PDFocrStrategy": "auto"
}
Tau n Ro
  • 108
  • 8
0

If it always reaches the timeout, try increasing it with a tika-config.xml file.

tika-config.xml

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <server>
        <params>
            <taskTimeoutMillis>600000</taskTimeoutMillis>
        </params>
    </server>
</properties>
$ docker run -d -p 9998:9998 \
             -v $PWD/tika-config.xml:/tika-config.xml \
             apache/tika:2.8.0.0-full \
             --config /tika-config.xml
Lakindu
  • 1,010
  • 8
  • 14