Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

1283 questions

votes

6 answers

Read Content from Files which are inside Zip file

I am trying to create a simple java program which reads and extracts the content from the file(s) inside zip file. Zip file contains 3 files (txt, pdf, docx). I need to read the contents of all these files and I am using Apache Tika for this…

java zip extract apache-tika

asked Mar 27 '13 at 18:54

S Jagdeesh

1,523
2
28
47

votes

1 answer

How to add a custom MIME type and override a default extension pattern?

I am trying to add a custom mime type to Apache Tika. I have the following custom-mimetypes.xml document in org.apache.tika.mime :

java mime apache-tika

asked Feb 22 '13 at 03:49

user177800

votes

1 answer

How to determine appropriate file extension from MIME Type in Java

I am uploading files to an Amazon s3 bucket and have access to the InputStream and a String containing the MIME Type of the file but not the original file name. It's up to me to actually create the file name and extension before pushing the file up…

java amazon-s3 apache-tika

asked Nov 30 '12 at 17:44

rphutchinson

votes

7 answers

Use tika with python, runtimeerror: unable to start tika server

I am trying to use the tika package to Parse files. Tika is successfully installed, tika-server-1.18.jar runned with Code in cmd Java -jar tika-server-1.18.jar My code in the Jupyter is: import tika from tika import parser parsed =…

python parsing apache-tika

asked Jul 25 '18 at 08:28

Sha Li

votes

3 answers

How to use Tika in server mode

On Tika's website it says (concerning tika-app-1.2.jar) it can be used in server mode. Does anyone know how to send documents and receive parsed text from this server once it is running?

apache-tika

asked Sep 01 '12 at 21:39

Serge Anido

votes

4 answers

How to get file extension from content type?

I'm using Apache Tika, and I have files (without extension) of particular content type that need to be renamed to have extension that reflect the content type. Any idea if there is something I could use instead of programming that from scratch based…

java content-type apache-tika

asked Apr 04 '11 at 16:48

lisak

21,611
40
152
243

votes

3 answers

How to read large files using TIka?

I'm parsing large pdf and word documents using Tika but I get he followiing error message. Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your…

apache-tika

asked Jun 26 '15 at 18:02

HHH

6,085
20
92
164

votes

6 answers

Indexing PDF with Solr

Can anyone point me to a tutorial. My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs. I have seen this:…

solr full-text-search solrj apache-tika solr-cell

asked Jul 14 '11 at 13:57

Mark

2,522
5
36
42

votes

1 answer

Apache Tika and character limit when parsing documents

Could please anybody help me to sort it out? It can be done like this Tika tika = new Tika(); tika.setMaxStringLength(10*1024*1024); But if you don't use Tika directly, like this: ContentHandler textHandler = new…

java text-processing apache-tika

asked May 26 '11 at 20:33

lisak

21,611
40
152
243

votes

1 answer

How to extract text from a directory of PDF files efficiently with OCR?

I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to: import multiprocessing import textract def extract_txt(file_path): text =…

python python-3.x parallel-processing tesseract apache-tika

asked Apr 28 '17 at 05:09

john doe

2,233
7
37
58

votes

4 answers

Getting MimeType subtype with Apache tika

I'd need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc. If you look at mimetypes.xml there are mimeType elements composed of the iana.org mime-type and…

java mime-types detection apache-tika

asked Aug 21 '11 at 10:14

lisak

21,611
40
152
243

votes

4 answers

java.lang.IllegalArgumentException: protocol = http host = null

For this link http://bits.blogs.nytimes.com/2014/09/02/uber-banned-across-germany-by-frankfurt-court/?partner=rss&emc=rss this code doesn`t work but if I put another for exemple: https://www.google.com everything is ok: URL url = new…

java url apache-tika

asked Sep 03 '14 at 10:35

Goko Gorgiovski

1,364
2
13
20

votes

2 answers

Elasticsearch Parse Exception error when attempting to index PDF

I'm just getting started with elasticsearch. Our requirement has us needing to index thousands of PDF files and I'm having a hard time getting just ONE of them to index successfully. Installed the Attachment Type plugin and got response: Installed…

pdf base64 elasticsearch apache-tika osx-server

asked Jun 13 '12 at 14:50

Meltemi

37,979
50
195
293

votes

5 answers

python how to use tika with existing jar file without downloading again

I'm using Tika and I realized that each time the jar file is downloaded and placed in Temp folder Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to…

python apache-tika

asked Jun 12 '19 at 10:20

Michael Fish

votes

2 answers

PDFBox adding white spaces within words

When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly. I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads section of this page…

solr lucene pdfbox apache-tika

asked Oct 31 '11 at 14:06

Ravish Bhagdev

2 3

…

85 86 Next