Questions tagged [tika-python]

15 questions
3
votes
0 answers

How to read PDF/DOCX page by page using tika library in python?

`# #!/usr/bin/env python import tika tika.initVM() from tika import parser parsed = parser.from_file('frank_diary.docx') print(parsed["metadata"]) print(parsed["content"])` From this code i am able to read whole file but not page by page. Ref. I…
2
votes
1 answer

Increase OCR timeout in TIKA

In the newestTika:2.5 default OCR timeout is 300 - not enough if multiple parallel processed documents or images doing OCR which leads to Tika OCR timeouts and so Tika exception for full document. I've tried add X-Tika-Timeout-Millis header but it…
Kate
  • 33
  • 2
2
votes
0 answers

I have extract the pdf file using python tika but i want to extract header and footer details. so how can i get that one?

import tika from tika import parser FileName = "sample.pdf" PDF_Parse = parser.from_file(FileName) print(PDF_Parse ['content']) print(PDF_Parse ['metadata']) but i want to extract header and footer details.what should i do??? using python tika???
2
votes
1 answer

Increase tika heap size in Python with tika-python

Can someone suggest a way to give tika a larger heap size (1 GByte or so) while using tika-python (on Windows)? I get "status: 500" errors from tika when processing very large Microsoft Word files. If I run tika from the Windows command line as…
nerdfever.com
  • 1,652
  • 1
  • 20
  • 41
0
votes
1 answer

can't parse IP address from PDF file, no error, just empty

I'm using Tika to parse IP addresses from a PDF file. Below is my code: import tika from tika import parser import re # Press the green button in the gutter to run the script. if __name__ == '__main__': tika.initVM() # opening pdf file …
Huy Than
  • 1,538
  • 2
  • 16
  • 31
0
votes
0 answers

How can I extract text from an image in a pdf using the python port of Apache/Tika 2.6.0?

import tika from tika import parser import pytesseract from PIL import Image import numpy import scipy from tika import config tika.initVM() headers={'X-Tika-OCRLanguage': 'eng','X-Tika-PDFextractInlineImages': 'true','X-Tika-PDFOcrStrategy':…
ScottyCov
  • 21
  • 5
0
votes
0 answers

Extract text from a folder with many pdfs with python pandas and jupyter

I have multiple directories containing many pdf documents. What I would like to do is to convert them with Python to PlainText, all in one file, where I can search for the text in the created .text file and in a second column the reference link to…
scofx
  • 149
  • 12
0
votes
1 answer

Latest Tesseract in Tika

Newest available version of Tesseract is 5.x. but the latest tika is still using 4.x. Is it possible to upgrade version of tesseractOCR in Tika?
0
votes
0 answers

running tika-python in docker container offline

I have a web app which uses tika-python, it works fine and each time I start it, it downloads two files "tika-server.jar" and "tika-server.jar" to local and parses files. But sometimes its unable to download those files so this service doesn't work…
Garuda
  • 46
  • 7
0
votes
1 answer

How to get "Fast Web View" property value from pdf using python or any other source?

Is there a way to extract Fast Web View property value programmatically? Python would be preferred. Thanks Manohar
Manohar KM
  • 11
  • 1
0
votes
2 answers

How to deal with large pdf?

I'm trying to extract text from a large pdf using this code(my file comes from a blob on azure and the pdf takes 7.3mb, it has got 140 pages and they are all images) and it's always reaching the timeout. os.environ['TIKA_SERVER_ENDPOINT'] =…
Tau n Ro
  • 108
  • 8
0
votes
1 answer

Tika server returned 500 status code when processing a pdf file

Code : dd= parser.from_file(r"file_path") Line number 554 in tika .py resp = verbFn(serviceUrl, encodedData, **effectiveRequestOptions) Reason in resp was INKApi Error. I am running tika server on my system.
shobhna
  • 13
  • 6
0
votes
1 answer

Tika server fails to start in airflow(from the fourth simultaneous run) deployed on kubernetes

I wanted to ask if any of you have encountered a similar error. I am working in a company where we are using airflow, deployed on Azure kubernetes. We have a Dag in charge of extracting some information about different documents. Among many of the…
Tau n Ro
  • 108
  • 8
0
votes
2 answers

How to extract text from multiple pdf in a location with specific line and store in Excel?

I have 100 pdf stored in a location and I want to extract text from them and store in excel below is pdf image in this i want (stored in page1) bid no,end date,item category,organisation name needed OEM Average Turnover (Last 3 Years),Years of…
Deepak Jain
  • 137
  • 1
  • 3
  • 27
-1
votes
1 answer

Find multiple text in pdfs

I'm currently trying to pull pdf's with the following list of text. I was able to pull pdf's but with only one word. should i change my condition below? thanks in advance. newbie here. from tika import parser import glob path =…
MFalcon
  • 23
  • 4