Questions tagged [tika-server]

90 questions
10
votes
0 answers

How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

I am attempting to Tika parse dozens of millions of office documents. Pdfs, docs, excels, xmls, etc. Wide assortment of types. Throughput is very important. I need to be able parse these files in a reasonable amount of time, but at the same time,…
Nicholas DiPiazza
  • 10,029
  • 11
  • 83
  • 152
6
votes
1 answer

Apache Tika Server - Request Header Parameters?

The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy. e.g: $ curl -T test/Dokument01.pdf http://localhost:9998/tika --header…
Ralph
  • 4,500
  • 9
  • 48
  • 87
6
votes
1 answer

Warning message from tika python module using the unpack method

I'm currently using tika to extract the text from pdf files. I found a very fast method within the tika module. This method is called unpack. This is my code: from tika import unpack text = unpack.from_file('example.pdf')['content'] However, once…
teller.py3
  • 822
  • 8
  • 22
4
votes
0 answers

Tika python does not preserve the order of texts in pdf

I am using tika-python to extract text from pdf. But when there are multiple table in a pdf page, the order of the text is not preserved. In my case the table at the top of the page comes at the end when extracted through tika. I tried using…
ggaurav
  • 1,764
  • 1
  • 10
  • 10
4
votes
3 answers

AttributeError: 'bytes' object has no attribute 'close' when Tika parser is run

Im trying to run a simple parse line of code using Tika to parse text from a PDF (named outputFileName in this example). This used to run without errors. I recently had my laptop sent in to our work IT for software updates and had to resintall…
dweir247
  • 63
  • 4
4
votes
1 answer

Is there a way to turn off parsing of embedded docs in the tika-server?

I run an unmodified JAX-RS instance of the Apache tika-server 1.22 and use it as an HTTP end-point service that I post files to (mostly Office, PDF and RTF) and get plain-text renditions back with HTTP requests (using the Accept="text/plain" header)…
4
votes
1 answer

Python Tika cannot parse pdf from url

python for parsing the online pdf for future usage. My code are below. from tika import parser import requests import io url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf' response = requests.get(url) with…
Platalea Minor
  • 877
  • 2
  • 9
  • 22
4
votes
2 answers

Python - Apache Tika Single Page parser

I was wondering if there is any way using Tika/Python to only parse the first page or extract the metadata from the first page only? Right now, when I pass the pdf, it is parsing every single page. I looked that this link: Is it possible to extract…
sharp
  • 2,140
  • 9
  • 43
  • 80
4
votes
0 answers

How to limit amount of extracted text with Tika server?

In my scenario, i have some large PDF files and would like to limit the amount of text extracted and returned by tika server. I know it's possible using Java library directly. However, how can I do this when making HTTP requests to tika-server /tika…
Eugene Shvets
  • 4,561
  • 13
  • 19
3
votes
3 answers

How do you enable the TesseractOCRParser using TikaConfig and the Tika command line utility?

I have installed apache Tika 1.8 and it is running perfectly except the OCR part is not working. I have Tesseract installed and it is also working properly. When I try to send a pdf with an image on it I get the following. WARNING: Tesseract OCR is…
Dunski
  • 653
  • 5
  • 14
3
votes
1 answer

JNIUS & TIKA - error trying to parseToString

tried to run the tike-app with jnius but got a problem (macOS Sierra, Java 1.8 JDK, Python 2.7 & Python 3.6) Everything works fine (output for tika.detect is fine) until the parseToString command. It seems there's a pop up showing off if you run…
2
votes
1 answer

Increase OCR timeout in TIKA

In the newestTika:2.5 default OCR timeout is 300 - not enough if multiple parallel processed documents or images doing OCR which leads to Tika OCR timeouts and so Tika exception for full document. I've tried add X-Tika-Timeout-Millis header but it…
Kate
  • 33
  • 2
2
votes
0 answers

Tika child processes keep dying

Tika child processes keep dying. I tried to increase the heap size to 2GB but that doesn't seem to affect anything, after ~100 files the child process just dies and the Tika server restarts it. I have 8GB RAM/4 CPUs assigned to it, and this is my…
user17365408
2
votes
1 answer

Python - Tika Parser - Content Not Loading

I have a few PDFs that I was able to parse until a few days ago using tika. I have not changed anything from my code, but am no longer able to view the content in those same PDFs by running the below code: from tika import parser raw =…
santorch
  • 151
  • 1
  • 14
2
votes
1 answer

getting hocr output from tika-server

I am doing OCR to a PDF file using Apache TIKA Server. I am interested in the hOCR output, but only succeed to get the output in plain text format. Following the wiki and the code, I am trying to configure Tesseract using X-Tika-OCR... HTTP headers.…
Amnon
  • 2,212
  • 1
  • 19
  • 35
1
2 3 4 5 6