1

I have a DIH configuration where I want to combine data from DB and Tika, by passing the filename from db to Tika. Problem is that filename in Tika is coming as empty. Logs say:

ERROR (Thread-16) [   ] o.a.s.h.d.DataImporter Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file:  (resolved to: C:\Users\jimbo\Desktop\solr-8.9.0\server\.

My configuration xml file is this:

<dataConfig>
    <dataSource name="ds-db" driver="org.mariadb.jdbc.Driver" url="jdbc:mysql://localhost:3306/eepyakm?user=root" user="root" password="wpadmin"/>
    <dataSource name="ds-file" type="BinFileDataSource"/>
    <document>
        <entity name="supplier" query="select * from suppliers_tmp_view" dataSource="ds-db" 
                deltaQuery="select id from suppliers_tmp_view where last_modified > '${dataimporter.last_index_time}'"
                deltaImportQuery="select * from suppliers_tmp_view where id='${dataimporter.delta.id}'">
             
            <entity name="attachment" dataSource="ds-db" 
                    query="select * from suppliers_tmp_files_view where supplier_tmp_id='${supplier.id}'"
                    deltaQuery="select id,supplier_tmp_id from suppliers_tmp_files_view where last_modified > '${dataimporter.last_index_time}'"
                    parentDeltaQuery="select id from suppliers_tmp_view where id='${attachment.supplier_tmp_id}'">
            
                <field name="path" column="path"/>
                
                <entity name="file" processor="TikaEntityProcessor" url="${attachment.path}" format="text" dataSource="ds-file">
                    
                    <field column="text"/>
                </entity>
            </entity>
        </entity>
    </document>
</dataConfig>

I found a similar problem at a very old post: Solr's TikaEntityProcessor not working

jim
  • 43
  • 6
  • Ok, the configuration is correct. However, not all of my database records have a "path" column value. This appears to be breaking the file datasource but in reality, it's not finding a file at [null] path. It even tries to get a default path (which of course fails). I guess that modifying the queries for the 'path' field to only bring in non-null entries will keep Tika happy. – jim Aug 19 '21 at 09:10

0 Answers0