How to deal with large pdf?

Question

I'm trying to extract text from a large pdf using this code(my file comes from a blob on azure and the pdf takes 7.3mb, it has got 140 pages and they are all images) and it's always reaching the timeout.

os.environ['TIKA_SERVER_ENDPOINT'] = 'http://0.0.0.0:9998/'

headers = {
    "X-Tika-OCRLanguage": "eng+nor",
    "X-Tika-PDFextractInlineImages": "true",  # run OCR against inline images
}

data = parser.from_buffer(
    buffer.readall(),
    xmlContent=True, 
    requestOptions={
        "headers": headers, 
        "timeout": 3600
   }
)

Is there any header I'm missing about to handle large files?

I'm using tika-server running it directly on a docker image with this command:

docker run -d -p 9998:9998 apache/tika:1.28.2-full

Thanks for your time!

score 0 · Accepted Answer · answered May 25 '22 at 16:02

0

I think I've managed to solve the problem. I only needed to change the headers, for the moment it's working:

headers = {
    "X-Tika-OCRLanguage": "eng+nor",
    "X-Tika-PDFocrStrategy": "auto"
}

answered May 25 '22 at 16:02

Tau n Ro

108
8

Lakindu · Answer 2 · 2023-08-24T12:45:42.193

If it always reaches the timeout, try increasing it with a tika-config.xml file.

tika-config.xml

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <server>
        <params>
            <taskTimeoutMillis>600000</taskTimeoutMillis>
        </params>
    </server>
</properties>

$ docker run -d -p 9998:9998 \
             -v $PWD/tika-config.xml:/tika-config.xml \
             apache/tika:2.8.0.0-full \
             --config /tika-config.xml

How to deal with large pdf?

2 Answers2