Questions tagged [tika-python]
15 questions
3
votes
0 answers
How to read PDF/DOCX page by page using tika library in python?
`# #!/usr/bin/env python
import tika
tika.initVM()
from tika import parser
parsed = parser.from_file('frank_diary.docx')
print(parsed["metadata"])
print(parsed["content"])`
From this code i am able to read whole file but not page by page.
Ref. I…

dilip kukadiya
- 31
- 2
2
votes
1 answer
Increase OCR timeout in TIKA
In the newestTika:2.5 default OCR timeout is 300 - not enough if multiple parallel processed documents or images doing OCR which leads to Tika OCR timeouts and so Tika exception for full document.
I've tried add X-Tika-Timeout-Millis header but it…

Kate
- 33
- 2
2
votes
0 answers
I have extract the pdf file using python tika but i want to extract header and footer details. so how can i get that one?
import tika
from tika import parser
FileName = "sample.pdf"
PDF_Parse = parser.from_file(FileName)
print(PDF_Parse ['content'])
print(PDF_Parse ['metadata'])
but i want to extract header and footer details.what should i do??? using python tika???

jothi prabu
- 21
- 1
2
votes
1 answer
Increase tika heap size in Python with tika-python
Can someone suggest a way to give tika a larger heap size (1 GByte or so) while using tika-python (on Windows)?
I get "status: 500" errors from tika when processing very large Microsoft Word files. If I run tika from the Windows command line as…

nerdfever.com
- 1,652
- 1
- 20
- 41
0
votes
1 answer
can't parse IP address from PDF file, no error, just empty
I'm using Tika to parse IP addresses from a PDF file. Below is my code:
import tika
from tika import parser
import re
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
tika.initVM()
# opening pdf file
…

Huy Than
- 1,538
- 2
- 16
- 31
0
votes
0 answers
How can I extract text from an image in a pdf using the python port of Apache/Tika 2.6.0?
import tika
from tika import parser
import pytesseract
from PIL import Image
import numpy
import scipy
from tika import config
tika.initVM()
headers={'X-Tika-OCRLanguage': 'eng','X-Tika-PDFextractInlineImages': 'true','X-Tika-PDFOcrStrategy':…

ScottyCov
- 21
- 5
0
votes
0 answers
Extract text from a folder with many pdfs with python pandas and jupyter
I have multiple directories containing many pdf documents.
What I would like to do is to convert them with Python to PlainText, all in one file, where I can search for the text in the created .text file and in a second column the reference link to…

scofx
- 149
- 12
0
votes
1 answer
Latest Tesseract in Tika
Newest available version of Tesseract is 5.x. but the latest tika is still using 4.x.
Is it possible to upgrade version of tesseractOCR in Tika?

Kate
- 33
- 2
0
votes
0 answers
running tika-python in docker container offline
I have a web app which uses tika-python, it works fine and each time I start it, it downloads two files "tika-server.jar" and "tika-server.jar" to local and parses files.
But sometimes its unable to download those files so this service doesn't work…

Garuda
- 46
- 7
0
votes
1 answer
How to get "Fast Web View" property value from pdf using python or any other source?
Is there a way to extract Fast Web View property value programmatically? Python would be preferred.
Thanks
Manohar

Manohar KM
- 11
- 1
0
votes
2 answers
How to deal with large pdf?
I'm trying to extract text from a large pdf using this code(my file comes from a blob on azure and the pdf takes 7.3mb, it has got 140 pages and they are all images) and it's always reaching the timeout.
os.environ['TIKA_SERVER_ENDPOINT'] =…

Tau n Ro
- 108
- 8
0
votes
1 answer
Tika server returned 500 status code when processing a pdf file
Code :
dd= parser.from_file(r"file_path")
Line number 554 in tika .py
resp = verbFn(serviceUrl, encodedData, **effectiveRequestOptions)
Reason in resp was INKApi Error.
I am running tika server on my system.

shobhna
- 13
- 6
0
votes
1 answer
Tika server fails to start in airflow(from the fourth simultaneous run) deployed on kubernetes
I wanted to ask if any of you have encountered a similar error.
I am working in a company where we are using airflow, deployed on Azure kubernetes.
We have a Dag in charge of extracting some information about different documents. Among many of the…

Tau n Ro
- 108
- 8
0
votes
2 answers
How to extract text from multiple pdf in a location with specific line and store in Excel?
I have 100 pdf stored in a location and I want to extract text from them and store in excel
below is pdf image
in this i want (stored in page1)
bid no,end date,item category,organisation name
needed
OEM Average Turnover (Last 3 Years),Years of…

Deepak Jain
- 137
- 1
- 3
- 27
-1
votes
1 answer
Find multiple text in pdfs
I'm currently trying to pull pdf's with the following list of text. I was able to pull pdf's but with only one word. should i change my condition below? thanks in advance. newbie here.
from tika import parser
import glob
path =…

MFalcon
- 23
- 4