I have a website where users upload documents in .doc and .pdf format. I am using Sphinx to conduct full text searches on my SQL database (MySQL). What is the best way to index these file formats with Sphinx?
Asked
Active
Viewed 7,796 times
3 Answers
9
The method I use for this is pdf2text and antiword. I use both of these to dump the contents of the pdfs and word documents into the database. From there it's easy to crawl with Sphinx.

mlissner
- 17,359
- 18
- 106
- 169
6
Unfortunately, Sphinx can't index those file types directly. You'll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.

pat
- 16,116
- 5
- 40
- 46
-
Would you recommend one method over another? – Jared Brown Jul 31 '09 at 04:04
-
Depends what server-side language you're using. If it's Ruby/Rails, I know all the libraries don't support XML out of the box, unless you're building a system from scratch (instead of, say, using ActiveRecord). So I'd use the database. Otherwise, it's completely up to you. If you're not using Ruby, have a look at what libraries are out there for your language of choice, see what they can/can't do. – pat Aug 02 '09 at 20:30
1
Has anyone used Tika to index other types of documents, much like the SOLR plugin? Apache Tika
Some links:

Wadester
- 343
- 2
- 5