Highest Voted 'tika-server' Questions

10

votes

0 answers

How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

I am attempting to Tika parse dozens of millions of office documents. Pdfs, docs, excels, xmls, etc. Wide assortment of types. Throughput is very important. I need to be able parse these files in a reasonable amount of time, but at the same time,…

asked Nov 22 '20 at 05:27

Nicholas DiPiazza

10,029
11
83
152

6

votes

1 answer

Apache Tika Server - Request Header Parameters?

The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy. e.g: $ curl -T test/Dokument01.pdf http://localhost:9998/tika --header…

apache-tika tika-server

asked May 25 '20 at 21:26

Ralph

4,500
9
48
87

6

votes

1 answer

Warning message from tika python module using the unpack method

I'm currently using tika to extract the text from pdf files. I found a very fast method within the tika module. This method is called unpack. This is my code: from tika import unpack text = unpack.from_file('example.pdf')['content'] However, once…

python python-3.x apache-tika tika-server

asked Nov 02 '18 at 16:07

teller.py3

822
8
22

4

votes

0 answers

Tika python does not preserve the order of texts in pdf

I am using tika-python to extract text from pdf. But when there are multiple table in a pdf page, the order of the text is not preserved. In my case the table at the top of the page comes at the end when extracted through tika. I tried using…

python apache-tika tika-server

asked May 14 '20 at 11:12

ggaurav

1,764
1
10
10

4

votes

3 answers

AttributeError: 'bytes' object has no attribute 'close' when Tika parser is run

Im trying to run a simple parse line of code using Tika to parse text from a PDF (named outputFileName in this example). This used to run without errors. I recently had my laptop sent in to our work IT for software updates and had to resintall…

python parsing apache-tika pdf-parsing tika-server

asked Nov 11 '19 at 14:46

dweir247

63
4

4

votes

1 answer

Is there a way to turn off parsing of embedded docs in the tika-server?

I run an unmodified JAX-RS instance of the Apache tika-server 1.22 and use it as an HTTP end-point service that I post files to (mostly Office, PDF and RTF) and get plain-text renditions back with HTTP requests (using the Accept="text/plain" header)…

apache-tika tika-server

asked Oct 10 '19 at 08:29

henrythewasp

43
4

4

votes

1 answer

Python Tika cannot parse pdf from url

python for parsing the online pdf for future usage. My code are below. from tika import parser import requests import io url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf' response = requests.get(url) with…

python apache-tika tika-server

asked Nov 25 '18 at 16:28

Platalea Minor

877
2
9
22

4

votes

2 answers

Python - Apache Tika Single Page parser

I was wondering if there is any way using Tika/Python to only parse the first page or extract the metadata from the first page only? Right now, when I pass the pdf, it is parsing every single page. I looked that this link: Is it possible to extract…

python apache-tika tika-server

asked Nov 01 '18 at 00:05

sharp

2,140
9
43
80

4

votes

0 answers

How to limit amount of extracted text with Tika server?

In my scenario, i have some large PDF files and would like to limit the amount of text extracted and returned by tika server. I know it's possible using Java library directly. However, how can I do this when making HTTP requests to tika-server /tika…

apache-tika tika-server

asked Jan 05 '17 at 23:24

Eugene Shvets

4,561
13
19

3

votes

3 answers

How do you enable the TesseractOCRParser using TikaConfig and the Tika command line utility?

I have installed apache Tika 1.8 and it is running perfectly except the OCR part is not working. I have Tesseract installed and it is also working properly. When I try to send a pdf with an image on it I get the following. WARNING: Tesseract OCR is…

apache configuration ocr tesseract tika-server

asked Aug 02 '18 at 13:59

Dunski

653
5
14

3

votes

1 answer

JNIUS & TIKA - error trying to parseToString

tried to run the tike-app with jnius but got a problem (macOS Sierra, Java 1.8 JDK, Python 2.7 & Python 3.6) Everything works fine (output for tika.detect is fine) until the parseToString command. It seems there's a pop up showing off if you run…

java python apache-tika pyjnius tika-server

asked May 14 '17 at 09:55

Julian Decker

41
4

2

votes

1 answer

Increase OCR timeout in TIKA

In the newestTika:2.5 default OCR timeout is 300 - not enough if multiple parallel processed documents or images doing OCR which leads to Tika OCR timeouts and so Tika exception for full document. I've tried add X-Tika-Timeout-Millis header but it…

tesseract apache-tika tika-server tika-python

asked Dec 01 '22 at 15:05

Kate

33
2

2

votes

0 answers

Tika child processes keep dying

Tika child processes keep dying. I tried to increase the heap size to 2GB but that doesn't seem to affect anything, after ~100 files the child process just dies and the Tika server restarts it. I have 8GB RAM/4 CPUs assigned to it, and this is my…

jvm apache-tika tika-server

asked Aug 02 '22 at 10:20

user17365408

2

votes

1 answer

Python - Tika Parser - Content Not Loading

I have a few PDFs that I was able to parse until a few days ago using tika. I have not changed anything from my code, but am no longer able to view the content in those same PDFs by running the below code: from tika import parser raw =…

python apache-tika tika-server

asked May 17 '20 at 02:18

santorch

151
1
14

2

votes

1 answer

getting hocr output from tika-server

I am doing OCR to a PDF file using Apache TIKA Server. I am interested in the hOCR output, but only succeed to get the output in plain text format. Following the wiki and the code, I am trying to configure Tesseract using X-Tika-OCR... HTTP headers.…

tesseract apache-tika tika-server hocr

asked Jan 09 '20 at 10:40

Amnon

2,212
1
19
35

Questions tagged [tika-server]