Alternative to Tika/PDFBox for parsing PDF in Solr (any version later than 1.4)

Question

Seems like Solr is not parsing my PDF files correctly. I was wondering if there is any other alternative to using Apache Tika (which I believe uses PDFBox internally) for parsing PDF files? I seem to be getting random spaces in between my content when using this. I have isolated the problem by running PDF through PDFBox directly (latest version) which has the same problem.

Some OCR commercial software such as Omnifind work on PDF fine but we are not able to integrate them with Solr in same way and buying is not an option either.

I've tried with 0.10, I think 1.0 just came out, haven't tried that yet. Will give it a shot tomorrow! Thanks. — Ravish Bhagdev, Nov 16 '11 at 23:39
The PDFBox team are actively working on the project, and each new release tends to improve things, so it's worth trying a newer Tika+PDFBox to see if it helps — Gagravarr, Nov 17 '11 at 09:45
Thanks for that. Yes, I tried with latest version of Tika 1.0 which I believe also uses latest version of PDFBox, it did improve things visibly when I used the new parameter they have added for turning off the auto spacing. However, still not quite perfect on documents I am trying on. — Ravish Bhagdev, Nov 17 '11 at 15:57

score 2 · Accepted Answer · edited May 23 '17 at 12:09

2

As the answer to this SO question indicates, this is due to the nature of the PDF format itself.

It's possible that OCR options do better on this problem than PDFBox, there are some free OCR options available like Tesseract and Ocropus but I have no idea how well they work or if they can be easily integrated with Solr.

edited May 23 '17 at 12:09

Community

1
1

answered Nov 16 '11 at 11:00

Tom De Leu

8,144
4
31
30

Thanks, I understand, but I am just trying and looking for alternatives so I can list which ones work best on what kind of documents. I am not looking for perfect solution since I read that reply :) – Ravish Bhagdev Nov 16 '11 at 23:41

score 1 · Answer 2 · answered Nov 16 '11 at 15:02

1

Xpdf contains pdftotext which converts documents a lot better then Tika.

answered Nov 16 '11 at 15:02

Okke Klein

2,549
17
9

4

can you go more into the details about what you mean by "a lot better"? – gondo Oct 08 '13 at 05:00

score 1 · Answer 3 · answered Nov 16 '11 at 15:05

1

I use jpod as a fallback library to extract from pdf when pdfbox fails completely (hang, crash...), so at least in some cases it works better than pdbbox for me.

answered Nov 16 '11 at 15:05

Persimmonium

15,593
11
47
78

Alternative to Tika/PDFBox for parsing PDF in Solr (any version later than 1.4)

3 Answers3