How to extract text from PDF using PDFExtStream using Java

Question

Text is not extracted from Sample.pdf file by using pdftextstream-2.6.3.jar

String filePath = "D:\\inbox\\temp\\Sample.pdf";
File document = new File(filePath);
StringBuffer pdfText = new StringBuffer(1024);
com.snowtide.pdf.OutputTarget tgt = new com.snowtide.pdf.OutputTarget(pdfText);
PDFTextStream stream = new PDFTextStream(document);
stream.pipe(tgt);
stream.close();

That file contains correctly encoded text ("Audit Case #0035") so you must be doing something wrong. Merely stating "it does not work" *is not a question*. — Jongware, Jan 07 '15 at 11:34
@Jongware It is working for another pdf documents, but not working for the attached pdf document. — UdayKiran Pulipati, Jan 07 '15 at 11:36
Define "not working"! Do you get an error, no text, no result at all, doesn't your program start, does it say "cannot find this file" ...??? — Jongware, Jan 07 '15 at 12:31
Your jar has been released in May 2013. Have you checked whether there is an update available? — mkl, Jan 07 '15 at 13:51
I just tested your code using the current library version 3.1.1, it failed. Your code has been marked as deprecated, though, so I tested with the current sample; that also failed. iText, PDFBox, and PDFClown, on the other hand, all succeeded. — mkl, Jan 07 '15 at 14:58
@mkl: Still in the dark :P What does "it failed" mean? Do you get an error message, or 'nothing', or not what you expected? The first text stream contains some binary data in an inline image (`BI..EI`) and unfortunately, at the time my own parser cannot handle this gracefully. Not sure if that is also with pdftextstream. — Jongware, Jan 07 '15 at 15:39
@Jongware *What does "it failed" mean?* - it extracts merely 3 empty lines. Maybe the inline images indeed are tripping stones here. Unfortunately PDFxStream is not open source, so I could not debug into it. — mkl, Jan 07 '15 at 15:48
@Jongware Download [Sample.pdf](https://app.box.com/s/n5fhwjp1wtl8hqoi7nra) file paste it under `D:\\inbox\\temp\\ ` folder. D drive -> inbox folder -> temp folder]. — UdayKiran Pulipati, Jan 08 '15 at 06:39

cemerick · Accepted Answer · 2015-01-13T13:40:35.380

3

Earlier today, we released PDFxStream v3.1.2. This is a bugfix release that includes a fix for the issue you encountered here.

In the future, please do get in touch with us directly if you have any difficulties, at help@snowtide.com; we do everything we can to support our customers and users.

edited Jan 13 '15 at 13:40

answered Jan 13 '15 at 13:34

cemerick

5,916
5
30
51

1

The OP never got to name his issue -- all he said was, "hey I dunno, fix this for me". Can you expand on what it exactly fixed, so we can point out this new version for other questions with a similar problem? – Jongware Jan 13 '15 at 14:04
2

Prior revisions of PDFxStream contained a set of errors in the bundled Adobe Glyph List mapping; it's rarely used, but the net result was that text extracts were effectively empty (linebreaks only, no "real" characters). – cemerick Jan 13 '15 at 18:48

How to extract text from PDF using PDFExtStream using Java

1 Answers1