10

I have a website where users upload documents in .doc and .pdf format. I am using Sphinx to conduct full text searches on my SQL database (MySQL). What is the best way to index these file formats with Sphinx?

buti-oxa
  • 11,261
  • 5
  • 35
  • 44
Jared Brown
  • 1,949
  • 4
  • 20
  • 28

3 Answers3

9

The method I use for this is pdf2text and antiword. I use both of these to dump the contents of the pdfs and word documents into the database. From there it's easy to crawl with Sphinx.

mlissner
  • 17,359
  • 18
  • 106
  • 169
6

Unfortunately, Sphinx can't index those file types directly. You'll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.

pat
  • 16,116
  • 5
  • 40
  • 46
  • Would you recommend one method over another? – Jared Brown Jul 31 '09 at 04:04
  • Depends what server-side language you're using. If it's Ruby/Rails, I know all the libraries don't support XML out of the box, unless you're building a system from scratch (instead of, say, using ActiveRecord). So I'd use the database. Otherwise, it's completely up to you. If you're not using Ruby, have a look at what libraries are out there for your language of choice, see what they can/can't do. – pat Aug 02 '09 at 20:30
1

Has anyone used Tika to index other types of documents, much like the SOLR plugin? Apache Tika

Some links:

  1. PDF2TEXT is in poppler or poppler-utils on Linux
  2. ANTIWORD -- seems to be for old .doc, not newer .docx
Wadester
  • 343
  • 2
  • 5