Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

tika-framework

Related Tags:

1283 questions
93
votes
6 answers

Read Content from Files which are inside Zip file

I am trying to create a simple java program which reads and extracts the content from the file(s) inside zip file. Zip file contains 3 files (txt, pdf, docx). I need to read the contents of all these files and I am using Apache Tika for this…
S Jagdeesh
  • 1,523
  • 2
  • 28
  • 47
40
votes
1 answer

How to add a custom MIME type and override a default extension pattern?

I am trying to add a custom mime type to Apache Tika. I have the following custom-mimetypes.xml document in org.apache.tika.mime :
user177800
40
votes
1 answer

How to determine appropriate file extension from MIME Type in Java

I am uploading files to an Amazon s3 bucket and have access to the InputStream and a String containing the MIME Type of the file but not the original file name. It's up to me to actually create the file name and extension before pushing the file up…
rphutchinson
  • 514
  • 1
  • 4
  • 8
29
votes
7 answers

Use tika with python, runtimeerror: unable to start tika server

I am trying to use the tika package to Parse files. Tika is successfully installed, tika-server-1.18.jar runned with Code in cmd Java -jar tika-server-1.18.jar My code in the Jupyter is: import tika from tika import parser parsed =…
Sha Li
  • 435
  • 1
  • 6
  • 13
29
votes
3 answers

How to use Tika in server mode

On Tika's website it says (concerning tika-app-1.2.jar) it can be used in server mode. Does anyone know how to send documents and receive parsed text from this server once it is running?
Serge Anido
  • 623
  • 2
  • 7
  • 11
26
votes
4 answers

How to get file extension from content type?

I'm using Apache Tika, and I have files (without extension) of particular content type that need to be renamed to have extension that reflect the content type. Any idea if there is something I could use instead of programming that from scratch based…
lisak
  • 21,611
  • 40
  • 152
  • 243
20
votes
3 answers

How to read large files using TIka?

I'm parsing large pdf and word documents using Tika but I get he followiing error message. Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your…
HHH
  • 6,085
  • 20
  • 92
  • 164
18
votes
6 answers

Indexing PDF with Solr

Can anyone point me to a tutorial. My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs. I have seen this:…
Mark
  • 2,522
  • 5
  • 36
  • 42
18
votes
1 answer

Apache Tika and character limit when parsing documents

Could please anybody help me to sort it out? It can be done like this Tika tika = new Tika(); tika.setMaxStringLength(10*1024*1024); But if you don't use Tika directly, like this: ContentHandler textHandler = new…
lisak
  • 21,611
  • 40
  • 152
  • 243
17
votes
1 answer

How to extract text from a directory of PDF files efficiently with OCR?

I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to: import multiprocessing import textract def extract_txt(file_path): text =…
john doe
  • 2,233
  • 7
  • 37
  • 58
16
votes
4 answers

Getting MimeType subtype with Apache tika

I'd need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc. If you look at mimetypes.xml there are mimeType elements composed of the iana.org mime-type and…
lisak
  • 21,611
  • 40
  • 152
  • 243
15
votes
4 answers

java.lang.IllegalArgumentException: protocol = http host = null

For this link http://bits.blogs.nytimes.com/2014/09/02/uber-banned-across-germany-by-frankfurt-court/?partner=rss&emc=rss this code doesn`t work but if I put another for exemple: https://www.google.com everything is ok: URL url = new…
Goko Gorgiovski
  • 1,364
  • 2
  • 13
  • 20
15
votes
2 answers

Elasticsearch Parse Exception error when attempting to index PDF

I'm just getting started with elasticsearch. Our requirement has us needing to index thousands of PDF files and I'm having a hard time getting just ONE of them to index successfully. Installed the Attachment Type plugin and got response: Installed…
Meltemi
  • 37,979
  • 50
  • 195
  • 293
14
votes
5 answers

python how to use tika with existing jar file without downloading again

I'm using Tika and I realized that each time the jar file is downloaded and placed in Temp folder Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to…
Michael Fish
  • 143
  • 1
  • 7
13
votes
2 answers

PDFBox adding white spaces within words

When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly. I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads section of this page…
Ravish Bhagdev
  • 955
  • 1
  • 13
  • 27
1
2 3
85 86