Hi fellow SOLR developers,
I have some pdf files which has some circuit diagrams. There are some text written vertically over the circuits. For instance, there is a word "junction connector" marked in the pdf, vertically, over a circuit stretch, which when indexed into SOLR becomes "j u n c t i o n C o n n e c t o r".
The search is not happening on the given keywords, for obvious reasons. Is it possible to change the underlying processor?
I tried to convert the pdf to text using 'itextpdf' in a standalone java class and 'itextpdf' prints the text decent enough. When I read the same pdf using 'Apache Tika', I see a lot of words broken with spaces, similar to the what SOLR does, obviously.
Is it even possible to develop and integrate a 'itextpdf' entity processor, for instance? or any other custom entity processor?
My worst alternative way is using solrj and reading the pdf and indexing it, but like mentioned, that is going to be my worst case alternative, because of environment and design constraints.
Using SOLR 5.3.1
I'm using the tika processor right now,
<dataConfig>
<dataSource type="BinFileDataSource" />
<document>
<entity name="tika-test" processor="TikaEntityProcessor"
url="C:\Users\12345\Downloads\workspace\Playground\circuits.pdf" format="text">
<field column="Author" name="creator" meta="true"/>
<field column="title" name="producer" meta="true"/>
<field column="text" name="text"/>
</entity>
</document>
The way SOLR index the documents is like this,
P o w e r Sou rc e T h e ft D e te rre n t a n d W ire le s s D o o r L o c k C o n tro l Turn Signal Flasher <6 –5 > DHEJ T–O V–R DJF C ombination M eter