Highest Voted 'tika-python' Questions

3

votes

0 answers

How to read PDF/DOCX page by page using tika library in python?

`# #!/usr/bin/env python import tika tika.initVM() from tika import parser parsed = parser.from_file('frank_diary.docx') print(parsed["metadata"]) print(parsed["content"])` From this code i am able to read whole file but not page by page. Ref. I…

asked Jan 04 '23 at 08:07

dilip kukadiya

31
2

2

votes

1 answer

Increase OCR timeout in TIKA

In the newestTika:2.5 default OCR timeout is 300 - not enough if multiple parallel processed documents or images doing OCR which leads to Tika OCR timeouts and so Tika exception for full document. I've tried add X-Tika-Timeout-Millis header but it…

tesseract apache-tika tika-server tika-python

asked Dec 01 '22 at 15:05

Kate

33
2

2

votes

0 answers

I have extract the pdf file using python tika but i want to extract header and footer details. so how can i get that one?

import tika from tika import parser FileName = "sample.pdf" PDF_Parse = parser.from_file(FileName) print(PDF_Parse ['content']) print(PDF_Parse ['metadata']) but i want to extract header and footer details.what should i do??? using python tika???

python-3.x pdf-scraping tika-python

asked Nov 30 '21 at 07:19

jothi prabu

21
1

2

votes

1 answer

Increase tika heap size in Python with tika-python

Can someone suggest a way to give tika a larger heap size (1 GByte or so) while using tika-python (on Windows)? I get "status: 500" errors from tika when processing very large Microsoft Word files. If I run tika from the Windows command line as…

python apache-tika tika-python

asked Oct 19 '21 at 21:03

nerdfever.com

1,652
1
20
41

0

votes

1 answer

can't parse IP address from PDF file, no error, just empty

I'm using Tika to parse IP addresses from a PDF file. Below is my code: import tika from tika import parser import re # Press the green button in the gutter to run the script. if __name__ == '__main__': tika.initVM() # opening pdf file …

python regex pdf tika-python

asked Mar 11 '23 at 04:11

Huy Than

1,538
2
16
31

0

votes

0 answers

How can I extract text from an image in a pdf using the python port of Apache/Tika 2.6.0?

import tika from tika import parser import pytesseract from PIL import Image import numpy import scipy from tika import config tika.initVM() headers={'X-Tika-OCRLanguage': 'eng','X-Tika-PDFextractInlineImages': 'true','X-Tika-PDFOcrStrategy':…

python python-tesseract tika-python

asked Jan 31 '23 at 19:13

ScottyCov

21
5

0

votes

0 answers

Extract text from a folder with many pdfs with python pandas and jupyter

I have multiple directories containing many pdf documents. What I would like to do is to convert them with Python to PlainText, all in one file, where I can search for the text in the created .text file and in a second column the reference link to…

python pandas jupyter-notebook tesseract tika-python

asked Jan 19 '23 at 11:08

scofx

149
12

0

votes

1 answer

Latest Tesseract in Tika

Newest available version of Tesseract is 5.x. but the latest tika is still using 4.x. Is it possible to upgrade version of tesseractOCR in Tika?

tesseract python-tesseract apache-tika tika-server tika-python

asked Sep 22 '22 at 13:19

Kate

33
2

0

votes

0 answers

running tika-python in docker container offline

I have a web app which uses tika-python, it works fine and each time I start it, it downloads two files "tika-server.jar" and "tika-server.jar" to local and parses files. But sometimes its unable to download those files so this service doesn't work…

python docker apache-tika tika-python

asked Sep 05 '22 at 05:16

Garuda

46
7

0

votes

1 answer

How to get "Fast Web View" property value from pdf using python or any other source?

Is there a way to extract Fast Web View property value programmatically? Python would be preferred. Thanks Manohar

python pypdf pdfminer pymupdf tika-python

asked Aug 16 '22 at 16:50

Manohar KM

11
1

0

votes

2 answers

How to deal with large pdf?

I'm trying to extract text from a large pdf using this code(my file comes from a blob on azure and the pdf takes 7.3mb, it has got 140 pages and they are all images) and it's always reaching the timeout. os.environ['TIKA_SERVER_ENDPOINT'] =…

python apache-tika tika-server tika-python

asked May 24 '22 at 15:54

Tau n Ro

108
8

0

votes

1 answer

Tika server returned 500 status code when processing a pdf file

Code : dd= parser.from_file(r"file_path") Line number 554 in tika .py resp = verbFn(serviceUrl, encodedData, **effectiveRequestOptions) Reason in resp was INKApi Error. I am running tika server on my system.

apache-tika tika-server tika-python

asked May 23 '22 at 08:42

shobhna

13
6

0

votes

1 answer

Tika server fails to start in airflow(from the fourth simultaneous run) deployed on kubernetes

I wanted to ask if any of you have encountered a similar error. I am working in a company where we are using airflow, deployed on Azure kubernetes. We have a Dag in charge of extracting some information about different documents. Among many of the…

python airflow apache-tika tika-server tika-python

asked Mar 02 '22 at 09:42

Tau n Ro

108
8

0

votes

2 answers

How to extract text from multiple pdf in a location with specific line and store in Excel?

I have 100 pdf stored in a location and I want to extract text from them and store in excel below is pdf image in this i want (stored in page1) bid no,end date,item category,organisation name needed OEM Average Turnover (Last 3 Years),Years of…

python pdf pypdf pdfminer tika-python

asked Feb 03 '22 at 11:15

Deepak Jain

137
1
3
27

-1

votes

1 answer

Find multiple text in pdfs

I'm currently trying to pull pdf's with the following list of text. I was able to pull pdf's but with only one word. should i change my condition below? thanks in advance. newbie here. from tika import parser import glob path =…

python tika-python

asked May 11 '22 at 11:19

MFalcon

23
4

Questions tagged [tika-python]