The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.
Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.
For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.
While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.
Related Tags:
apache-tikatika-servermime-typesmetadatatesseractlanguage-detectionparsing