Apache Tika is the best choice. Apache has recently extracted many sub-projects out of the existing projects and made them public. Tika is one of them that was previously a component of Apache Lucene. Because of Apache's support and reputation and the widely-used parent project Lucene it must be a very good choice. Furthermore, it is open-source.
A brief introduction from Apache Tika web site:
The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.
And the supported formats are:
HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format