Questions tagged [tika-server]
90 questions
10
votes
0 answers
How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?
I am attempting to Tika parse dozens of millions of office documents. Pdfs, docs, excels, xmls, etc. Wide assortment of types.
Throughput is very important. I need to be able parse these files in a reasonable amount of time, but at the same time,…

Nicholas DiPiazza
- 10,029
- 11
- 83
- 152
6
votes
1 answer
Apache Tika Server - Request Header Parameters?
The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy. e.g:
$ curl -T test/Dokument01.pdf http://localhost:9998/tika --header…

Ralph
- 4,500
- 9
- 48
- 87
6
votes
1 answer
Warning message from tika python module using the unpack method
I'm currently using tika to extract the text from pdf files. I found a very fast method within the tika module. This method is called unpack.
This is my code:
from tika import unpack
text = unpack.from_file('example.pdf')['content']
However, once…

teller.py3
- 822
- 8
- 22
4
votes
0 answers
Tika python does not preserve the order of texts in pdf
I am using tika-python to extract text from pdf. But when there are multiple table in a pdf page, the order of the text is not preserved. In my case the table at the top of the page comes at the end when extracted through tika.
I tried using…

ggaurav
- 1,764
- 1
- 10
- 10
4
votes
3 answers
AttributeError: 'bytes' object has no attribute 'close' when Tika parser is run
Im trying to run a simple parse line of code using Tika to parse text from a PDF (named outputFileName in this example). This used to run without errors. I recently had my laptop sent in to our work IT for software updates and had to resintall…

dweir247
- 63
- 4
4
votes
1 answer
Is there a way to turn off parsing of embedded docs in the tika-server?
I run an unmodified JAX-RS instance of the Apache tika-server 1.22 and use it as an HTTP end-point service that I post files to (mostly Office, PDF and RTF) and get plain-text renditions back with HTTP requests (using the Accept="text/plain" header)…

henrythewasp
- 43
- 4
4
votes
1 answer
Python Tika cannot parse pdf from url
python for parsing the online pdf for future usage. My code are below.
from tika import parser
import requests
import io
url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf'
response = requests.get(url)
with…

Platalea Minor
- 877
- 2
- 9
- 22
4
votes
2 answers
Python - Apache Tika Single Page parser
I was wondering if there is any way using Tika/Python to only parse the first page or extract the metadata from the first page only? Right now, when I pass the pdf, it is parsing every single page.
I looked that this link: Is it possible to extract…

sharp
- 2,140
- 9
- 43
- 80
4
votes
0 answers
How to limit amount of extracted text with Tika server?
In my scenario, i have some large PDF files and would like to limit the amount of text extracted and returned by tika server. I know it's possible using Java library directly. However, how can I do this when making HTTP requests to tika-server /tika…

Eugene Shvets
- 4,561
- 13
- 19
3
votes
3 answers
How do you enable the TesseractOCRParser using TikaConfig and the Tika command line utility?
I have installed apache Tika 1.8 and it is running perfectly except the OCR part is not working. I have Tesseract installed and it is also working properly.
When I try to send a pdf with an image on it I get the following.
WARNING: Tesseract OCR is…

Dunski
- 653
- 5
- 14
3
votes
1 answer
JNIUS & TIKA - error trying to parseToString
tried to run the tike-app with jnius but got a problem (macOS Sierra, Java 1.8 JDK, Python 2.7 & Python 3.6)
Everything works fine (output for tika.detect is fine) until the parseToString command. It seems there's a pop up showing off if you run…

Julian Decker
- 41
- 4
2
votes
1 answer
Increase OCR timeout in TIKA
In the newestTika:2.5 default OCR timeout is 300 - not enough if multiple parallel processed documents or images doing OCR which leads to Tika OCR timeouts and so Tika exception for full document.
I've tried add X-Tika-Timeout-Millis header but it…

Kate
- 33
- 2
2
votes
0 answers
Tika child processes keep dying
Tika child processes keep dying. I tried to increase the heap size to 2GB but that doesn't seem to affect anything, after ~100 files the child process just dies and the Tika server restarts it. I have 8GB RAM/4 CPUs assigned to it, and this is my…
user17365408
2
votes
1 answer
Python - Tika Parser - Content Not Loading
I have a few PDFs that I was able to parse until a few days ago using tika.
I have not changed anything from my code, but am no longer able to view the content in those same PDFs by running the below code:
from tika import parser
raw =…

santorch
- 151
- 1
- 14
2
votes
1 answer
getting hocr output from tika-server
I am doing OCR to a PDF file using Apache TIKA Server.
I am interested in the hOCR output, but only succeed to get the output in plain text format.
Following the wiki and the code, I am trying to configure Tesseract using X-Tika-OCR... HTTP headers.…

Amnon
- 2,212
- 1
- 19
- 35