0

Text is not extracted from Sample.pdf file by using pdftextstream-2.6.3.jar

String filePath = "D:\\inbox\\temp\\Sample.pdf";
File document = new File(filePath);
StringBuffer pdfText = new StringBuffer(1024);
com.snowtide.pdf.OutputTarget tgt = new com.snowtide.pdf.OutputTarget(pdfText);
PDFTextStream stream = new PDFTextStream(document);
stream.pipe(tgt);
stream.close();
UdayKiran Pulipati
  • 6,579
  • 7
  • 67
  • 92
  • 1
    That file contains correctly encoded text ("Audit Case #0035") so you must be doing something wrong. Merely stating "it does not work" *is not a question*. – Jongware Jan 07 '15 at 11:34
  • @Jongware It is working for another pdf documents, but not working for the attached pdf document. – UdayKiran Pulipati Jan 07 '15 at 11:36
  • 1
    Define "not working"! Do you get an error, no text, no result at all, doesn't your program start, does it say "cannot find this file" ...??? – Jongware Jan 07 '15 at 12:31
  • Your jar has been released in May 2013. Have you checked whether there is an update available? – mkl Jan 07 '15 at 13:51
  • I just tested your code using the current library version 3.1.1, it failed. Your code has been marked as deprecated, though, so I tested with the current sample; that also failed. iText, PDFBox, and PDFClown, on the other hand, all succeeded. – mkl Jan 07 '15 at 14:58
  • @mkl: Still in the dark :P What does "it failed" mean? Do you get an error message, or 'nothing', or not what you expected? The first text stream contains some binary data in an inline image (`BI..EI`) and unfortunately, at the time my own parser cannot handle this gracefully. Not sure if that is also with pdftextstream. – Jongware Jan 07 '15 at 15:39
  • 1
    @Jongware *What does "it failed" mean?* - it extracts merely 3 empty lines. Maybe the inline images indeed are tripping stones here. Unfortunately PDFxStream is not open source, so I could not debug into it. – mkl Jan 07 '15 at 15:48
  • @Jongware Download [Sample.pdf](https://app.box.com/s/n5fhwjp1wtl8hqoi7nra) file paste it under `D:\\inbox\\temp\\ ` folder. D drive -> inbox folder -> temp folder]. – UdayKiran Pulipati Jan 08 '15 at 06:39

1 Answers1

3

Earlier today, we released PDFxStream v3.1.2. This is a bugfix release that includes a fix for the issue you encountered here.

In the future, please do get in touch with us directly if you have any difficulties, at help@snowtide.com; we do everything we can to support our customers and users.

cemerick
  • 5,916
  • 5
  • 30
  • 51
  • 1
    The OP never got to name his issue -- all he said was, "hey I dunno, fix this for me". Can you expand on what it exactly fixed, so we can point out this new version for other questions with a similar problem? – Jongware Jan 13 '15 at 14:04
  • 2
    Prior revisions of PDFxStream contained a set of errors in the bundled Adobe Glyph List mapping; it's rarely used, but the net result was that text extracts were effectively empty (linebreaks only, no "real" characters). – cemerick Jan 13 '15 at 18:48