0

Possible Duplicate:
How to extract text from the PDF document?

Problem / Application: I am building a system in PHP/Java on a Windows 2008 Server running Apache. The concept is that a user will upload a PDF file. I then want the system to analyze the uploaded PFD file and generate a Title/Description using a algorithm I am going to design. Later my search engine will be able to search through the stored titles/descriptions to find PDF's relavent to the search. This will allow me to search stored PDF files without accessing the PDF's during the search.

What I need is a script or code that converts the PDF to text and store it to an array or something that I can then break down to get what I need.

I've found other threads that use unix/linux command line techniques. However I haven't found any scripts that will allow me to do what I need for Apache servers on Windows.

Any suggestions or alternative techniques I could use for this would be greatly appreciated!

Community
  • 1
  • 1
Vidarious
  • 746
  • 2
  • 10
  • 22

1 Answers1

0

Conversion of PDF files to plain text is problematic due to the way text is represented within them (as drawing instructions on a two dimensional surface), especially when the source is multi-columnar.

There are a number of both open source and proprietary tools you can use but having looked at all of them, I can confidently state none work for all cases. A Google search for "PDF to text conversion" will show you most of them.

You may also wish to explore use of a text search engine with PDF conversion built-in, like SOLR or elastic-search, both are open source and based on Apache Lucene. Again, a Google search for either will point you their respective homepages.

Rob Raisch
  • 17,040
  • 4
  • 48
  • 58