Indexing Word Documents and PDFs with Sphinx

Question

I have a website where users upload documents in .doc and .pdf format. I am using Sphinx to conduct full text searches on my SQL database (MySQL). What is the best way to index these file formats with Sphinx?

score 9 · Answer 1 · answered Apr 02 '11 at 22:01

9

The method I use for this is pdf2text and antiword. I use both of these to dump the contents of the pdfs and word documents into the database. From there it's easy to crawl with Sphinx.

answered Apr 02 '11 at 22:01

mlissner

17,359
18
106
169

score 6 · Accepted Answer · answered Jul 30 '09 at 21:16

6

Unfortunately, Sphinx can't index those file types directly. You'll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.

answered Jul 30 '09 at 21:16

pat

16,116
5
40
46

Would you recommend one method over another? – Jared Brown Jul 31 '09 at 04:04
Depends what server-side language you're using. If it's Ruby/Rails, I know all the libraries don't support XML out of the box, unless you're building a system from scratch (instead of, say, using ActiveRecord). So I'd use the database. Otherwise, it's completely up to you. If you're not using Ruby, have a look at what libraries are out there for your language of choice, see what they can/can't do. – pat Aug 02 '09 at 20:30

score 1 · Answer 3 · answered Oct 17 '13 at 19:37

1

Has anyone used Tika to index other types of documents, much like the SOLR plugin? Apache Tika

Some links:

answered Oct 17 '13 at 19:37

Wadester

343
2
5

Indexing Word Documents and PDFs with Sphinx

3 Answers3

Linked