I've found a few related solutions to this problem. The related solutions will not work for me as I'll explain. (I'm using Solr 4.0 and indexing data stored in an Oracle 11g database.)
Jonck van der Kogel's related solution (from 2009) is explained here. He describes creating a custom Transformer, sort of like the ClobTransformer that ships with Solr. This is going down the elegant path but is not using Tika which is now integrated with Solr. (He uses external PDFBox and FontBox.) This creates multiple maintenance / upgrade dependencies. Also, I need to be able to index Word documents in addition to PDF.
Since Kogel's solutions seems to be on the right path, is there a way to use the Tika classes included with Solr in a custom Transformer? That would allow all the Tika functionality with Kogel's elegant database solution.
Another related solution is the ExtractingRequestHandler (ERH) that ships with Solr. However, as the name suggests, this is a request handler, such as to handle HTTP posts of rich-text documents. To extract documents from the database this way has performance and security problems. I would have to make the database BLOBs accessible via HTTP. I've found no discussion of using ERH for direct ingest from a database BLOB. Is it possible to directly ingest from database BLOBs with Solr Cell?
Another related solution is to write a Transformer (like Kogel's above) to convert a byte[] to a string (from DataImportHandler FAQ). With true binary documents this is going to feed junk into the index and not properly extract the text elements like Tika does. Won't work.
A final related solution is UpdateRichDocuments offered by the RichDocumentHandler. This is deprecated and no longer available in Solr. The page refers you to the ExtractingRequestHandler (discussed above).
It seems like the right solution is to use DataImportHandler and a customer Transformer using the Tika class. How does this work?