2

Is there any free Java library for extracting text from PDF, that is compatible with Google Application Engine?

I've read about PDFJet, but it can't read PDF, can it?

Is there perhaps other way how to extract text from PDF? I tried http://www.pdfdownload.org/, unfortunately they don't handle non-English characters correctly.

Miroslav Bajtoš
  • 10,667
  • 1
  • 41
  • 99

5 Answers5

3

iText now has a text parsing module (I'm one of the parser authors). See the com.itextpdf.text.pdf.parser.PdfContentReaderTool class for an example of how to use it.

Kevin Day
  • 16,067
  • 8
  • 44
  • 68
  • iText uses certain classes (like java.awt.AffineTransform) that are not available on GAE. See this page for more details: http://groups.google.com/group/google-appengine-java/web/will-it-play-in-app-engine – Miroslav Bajtoš Mar 29 '10 at 06:07
  • hmmm. The parser library certainly doesn't use AffineTransform (I actually implemented my own matrix transformations for the parser). I know that iText *supports* affine transforms when generating PDF files, but I doubt that it's required for parsing. Post the class and method that is giving you problems with using this with app engine and I'll take a look. – Kevin Day Mar 30 '10 at 03:06
  • I just used iText successfully under GAE environment =) – rsalmeidafl Mar 17 '11 at 20:33
  • At the end I managed to get iText running in GAE - at least for the documents that I am parsing. – Miroslav Bajtoš Mar 28 '11 at 17:28
  • I am curious about what problems you ran into using GAE, and how you addressed them - like I said, nothing in the parsing code should use AffineTransform... – Kevin Day Mar 31 '11 at 03:08
2

I modified the latest (1.8.0-Snapshot) version to run on Google AppEngine. Had to disable one Unit-Test, but it runs fine for simple text extraction.

Following the simple try-fail-fix approach i had to modify 5 files in total. Pretty doable.

You'll also have to explicitly use a RandomAccessBuffer, like Fabrizio explained.

For the extra lazy, heres the compiled jar, dependencies for text extraction, and the patch. Note that it might not work for every usecase (i.e. rectangle based extraction). Used it to extract text of a whole page. https://docs.google.com/folder/d/0B53n_gP2oU6iVjhOOVBNZHk0a0E/edit

icyerasor
  • 4,973
  • 1
  • 43
  • 52
  • I am using [this](http://stackoverflow.com/questions/4955635/how-to-add-local-jar-files-in-maven-project/36602256#36602256) (Anthony O. answer) question to add the jar files in my project. should all jar files (including the dependencies) be added to same directory? – dina Feb 07 '17 at 12:46
  • 1
    Yes. Or better yet.. get commons logging and fontbox in "recent" versions through the regular maven-pom as dependency and try if it works. – icyerasor Feb 07 '17 at 17:20
  • thanks!! was very useful!! finally I can parse pdf on GAE :) – dina Feb 08 '17 at 11:27
2

PdfBox does not run on GAE. It uses not-allowed java classes.
(GAE only permits these http://code.google.com/appengine/docs/java/jrewhitelist.html)

I have partially modified a very old version of PdfBox (0.7.3) to be GAE complaiant. Now I'm able to extract text from PDF (whole page or rectangular area). I only modified a minumum part of the pdf text extraction and not the whole PdfBox. :)
The idea was to remove refences to java.awt.retangle & C. using my own "rectangle" class.

More info: http://fhtino.blogspot.com/2010/04/pdfbox-text-extration-gae.html

Fabrizio Accatino
  • 2,284
  • 20
  • 24
1

I know there is http://pdfbox.apache.org/index.html

Apache PDFBox is an open source Java PDF library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents.

but I've never tested it.

Pierre
  • 34,472
  • 31
  • 113
  • 192
-1

Last month, I'd just finished extracting text from pdf file in my project. I used XPDF tool for getting text, and text coordinates, but I used it in Xcode (Objective-C). This tool was open source, written by C++, and able to be encoded in many language. However, I didn't know whether XPdf would be work on your java, or not. Anyway, You can try this tool.