SOLR tika processor not crawling my PDF files prefectly

Question

Hi fellow SOLR developers,

I have some pdf files which has some circuit diagrams. There are some text written vertically over the circuits. For instance, there is a word "junction connector" marked in the pdf, vertically, over a circuit stretch, which when indexed into SOLR becomes "j u n c t i o n C o n n e c t o r".

The search is not happening on the given keywords, for obvious reasons. Is it possible to change the underlying processor?

I tried to convert the pdf to text using 'itextpdf' in a standalone java class and 'itextpdf' prints the text decent enough. When I read the same pdf using 'Apache Tika', I see a lot of words broken with spaces, similar to the what SOLR does, obviously.

Is it even possible to develop and integrate a 'itextpdf' entity processor, for instance? or any other custom entity processor?

My worst alternative way is using solrj and reading the pdf and indexing it, but like mentioned, that is going to be my worst case alternative, because of environment and design constraints.

Using SOLR 5.3.1

I'm using the tika processor right now,

<dataConfig>
<dataSource type="BinFileDataSource" />
<document>
    <entity name="tika-test" processor="TikaEntityProcessor"
            url="C:\Users\12345\Downloads\workspace\Playground\circuits.pdf" format="text">
            <field column="Author" name="creator" meta="true"/>
            <field column="title" name="producer" meta="true"/>
            <field column="text" name="text"/>
    </entity>
</document>

The way SOLR index the documents is like this,

P o w e r Sou rc e T h e ft D e te rre n t a n d W ire le s s D o o r L o c k C o n tro l Turn Signal Flasher <6 –5 > DHEJ T–O V–R DJF C ombination M eter

So you're saying that iText does a better job than the text extractor you're currently using. Then what's the problem? Why don't you use the text that is extracted using iText? Can't you feed plain text to SOLR? — Bruno Lowagie, Oct 23 '15 at 00:01
Thank you!! Edited the question with few more details. I can do but on a larger scale I'm looking for a custom entity processor, if that's even possible. — aswath86, Oct 23 '15 at 00:05
OK, the question has improved (and deserves an up vote). I don't know the answer, but it would surprise me if you couldn't parse PDF to text with iText and hand the text over to a tool that indexes it. I'm really interested in the answer to this question too. — Bruno Lowagie, Oct 23 '15 at 00:29
Thank you!!. Yes, because of environment and design constraints, I'm looking for a custom entity processor. — aswath86, Oct 23 '15 at 16:52

MatsLindh · Accepted Answer · 2015-10-23T13:53:19.393

3

The easiest (and not really the worst case alternative) way would be to write a small itextpdf submission module yourself, that scans a directory and uses SolrJ to submit the extracted text to Solr. This will allow for easier customization and parallelization of the indexing process in the future as well (running the extraction and indexing process on more than just one server).

The Tika extract handler will probably be moved out from Solr core and into a separate index tool at some time in the future anyways.

It would be possible to write a separate daemon that you can submit documents to and that has the different indexing strategies in the future, but there hasn't been done any work related to that yet.

edited Oct 23 '15 at 13:53

answered Oct 23 '15 at 12:40

MatsLindh

49,529
4
53
84

Thank you for responding. Its a worst case for 'me' actually, and not a worst case as such :) The environment here is complicated, hence, if there is a way to deploy a custom entity processor, that will be my absolute solution. – aswath86 Oct 23 '15 at 16:47
Start with [the source from TikaEntityProcessor](https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/dataimporthandler-extras/src/java/org/apache/solr/handler/dataimport/TikaEntityProcessor.java) and modify it to use iTextPDF instead. The class isn't very large, so as long as you've decently familiar with Java, it should be pretty straight forward. Remember that you'll have to load the resulting jar file into Solr as you do with the extras. I usually do a trunk checkout of Solr + Lucene and develop against that before creating the end jar. But keep in mind that DIH might disappear. – MatsLindh Oct 23 '15 at 21:01
Thank you for the pointer. That is what i'm hoping to do. I up voted. But after some intense testing, iTextPDF prints some other circuit PDFs even worse. Its like, I need to pick the least worse PDF reader. I'm still studying few other PDF parsers to see if they can read my PDFs good enough. – aswath86 Oct 23 '15 at 21:20
I think you might be seeing an issue where the actual layout (and structure) of the PDF makes it hard to get a decent human readable format. We've had decent success by running OCR with different rotations, depending on the quality of the scans. You could also do a lot of processing on the text and boost higher quality results (for example, collapsing most spaces with one letter words, etc.). If you can extract the coordinates of the text as well (pdftk under linux can do this - we use it to extract text from ads) you can use the coordinates to merge text that's close to each other. – MatsLindh Oct 23 '15 at 21:23
Yes. The fundamental issues is with PDF layout but I have no say in it. Hence trying to do something which is within SOLR's reach. Thank you for helping! – aswath86 Oct 26 '15 at 21:35
I tried indexing same set of PDFs in GSA and in Autonomy IDOL. GSA suffers the same problem as SOLR. However, Autonomy IDOL 7 was able to neatly index the PDF and make most words searchable. I'm trying to find out the underlying parser used by Autonomy IDOL – aswath86 Oct 26 '15 at 21:38
Are any of these PDFs public? – MatsLindh Oct 26 '15 at 22:36
No. Unfortunately these are sensitive documents. I would really like to share it with this community for a wider investigation but sadly I shouldn't. Thank you for asking! – aswath86 Oct 26 '15 at 23:35
@MatsLindh you have mentioned the coordinates of text in PDF, is it possible to extract and show them in solr (like a property of text)? – zygimantus Jan 19 '17 at 11:21
1

@zygimantus It would require writing some custom code for indexing, but yes, you should be able to do that. PDF libraries are usually able to extract the coordinate where the text is placed, and depending on your use case, you could embed it into the text itself, or use a separate field that have the text for each token separated by whitespace. – MatsLindh Jan 19 '17 at 11:24
Thank you @MatsLindh. The unclear part for me is how to get coordinates in Apache Tika tool and then pass them to Solr. Probably, I should be looking for this: "writing custom ExtractingRequestHandler"? – zygimantus Jan 19 '17 at 11:36
@zygimantus You'll have to either extend the request handler yourself, or write a small tool that extracts the information from the PDF (by using Tika, for example - or a different PDF library) and then submits that information to Solr using JSON as a regular Solr document. – MatsLindh Jan 19 '17 at 12:09
It is clear now. I thought that I need to submit PDFs directly to solr using this command `bin/post -c gettingstarted samples/*.pdf`, but as you stated I can just extract information separately and only then submit. – zygimantus Jan 19 '17 at 12:20

SOLR tika processor not crawling my PDF files prefectly

1 Answers1