UIMA (Unstructured Information Management Architecture) is an architecture for creating scalable applications that analyze and extract information from unstructured data sources such as text, audio, and video. Apache UIMA is an open-source Java framework implementing the UIMA architecture. UIMA applications typically use natural language processing (NLP) techniques to perform analysis.
UIMA (Unstructured Information Management Architecture) is an architecture for creating scalable applications that analyze and extract information from unstructured data sources such as text, audio, and video. UIMA is specified in an OASIS standard. Apache UIMA is an open-source Java framework implementing the UIMA architecture. Apache UIMA is based on code open-sourced by IBM. UIMA was a central part of Jeopardy-playing IBM's Watson computer. UIMA applications typically use natural language processing (NLP) techniques to perform analysis.
UIMA defines applications as Collection Processing Engines (CPEs). Each CPE includes a Collection Reader (CR), one or more Analysis Engines (AE), and optionally a CAS Consumer.
A Collection is a repository of data to be analyzed, and it may take a number of forms, including RDBMS tables, a schema-less database, or a set of files on a filesystem. The first component in a CPE is the Collection Reader, which reads in pieces of data from the the Collection and packages the pieces in a data structured called the Common Analysis Structure (CAS). Collections can be stored in many ways, including RDBMS tables, schema-less databases, and files on a filesystem.
The CR passes CAS objects on to the first Analysis Engine in the pipeline. Each AE analyzes the information artifact packaged in a CAS, constructs annotations from the results of the analysis (e.g. parts of speech for words or phrases), and adds these annotations to the CAS before passing it on downstream. At the end of the pipeline, a CAS Consumer does something useful with the annotations, such as writing them to a database, or to files, or adding them to a semantic search index. Since version 2 of UIMA, the Apache UIMA documentation recommends using Analysis Engines instead of CAS Consumers, since AEs possess all of the required functionality for consuming CAS objects.
Each UIMA component has a descriptor in XML that defines its behavior and parameters. The descriptor for a Collection Processing Engine refers to the descriptors of each of its components and overrides their settings if desired.
UIMA supports conditional flow control, such that an annotation made in a CAS can determine which branch of a pipeline it takes downstream.
UIMA Asynchronous Scaleout is an add-on that enables a UIMA application to run many instances of an Analysis Engine to support higher throughput.