PDFBox throws error while extracting text encoded with font DejaVu Sans Condensed

Question

    PDDocument document = PDDocument.load(file);
    if( document.isEncrypted() )
    {
        document.setAllSecurityToBeRemoved(false);
    }
    PDFTextStripper stripper = new PDFTextStripper();
    //stripper.setSortByPosition( true );
    String text = stripper.getText(document);
    System.out.println(text);
    OutputStreamWriter writer =
            new OutputStreamWriter(new FileOutputStream("C:\\preface.txt"), StandardCharsets.UTF_8);
    writer.write(text);
    writer.flush();
    writer.close();

I am trying to extract text from PDF file encoded with Dejavu Sans Condensed and DejaVu Sans Condensed-Bold but it throws an error given below:

SEVERE: Could not read ToUnicode CMap in font DejaVuSansCondensed
java.io.IOException: Error: expected the end of a dictionary.
at org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:477)
at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:112)
at org.apache.pdfbox.pdmodel.font.CMapManager.parseCMap(CMapManager.java:75)
at org.apache.pdfbox.pdmodel.font.PDFont.readCMap(PDFont.java:197)
at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:137)
at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:176)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:83)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
at Library.main(Library.java:32)
Jun 03, 2018 1:30:59 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font DejaVuSansCondensed are not implemented in PDFBox and will be ignored
Jun 03, 2018 1:30:59 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+98 (98) in font DejaVuSansCondensed
Jun 03, 2018 1:30:59 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+105 (105) in font DejaVuSansCondensed

I also find that there is no unicode mapping for that specific set of pdf files. Kindly help with the writing of unicode mapping for this program

P.S. I am new to PDFBox thing

The stack trace sounds like either that font in your file does have a broken or invalid **ToUnicode** map or one that differs from pdfbox's expectations. For further analysis, therefore, please share a sample pdf. — mkl, Jun 03 '18 at 08:15
https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0/39644941 — Tilman Hausherr, Jun 03 '18 at 08:53
Actually i can't post the pdf here as it is a confidential one and can anyone say me how do I add the unicode mapping to PDFBox? Do i add it to attributes.txt in PDFBox? — Praveen Kenny, Jun 03 '18 at 09:19
The link I posted tells just that. And yes, it is a huge pain. Btw make sure that you're using the latest version of PDFBox, which is 2.0.9. — Tilman Hausherr, Jun 04 '18 at 12:27
I am using the latest version of pdfbox i.e. 2.0.9 and I can't open the current pdf file with pdf debugger I will work a way out of it — Praveen Kenny, Jun 04 '18 at 18:27

score 0 · Answer 1 · answered Sep 10 '18 at 14:38

0

I could solve this problem by downgrading to PDFBox 2.0.2.

answered Sep 10 '18 at 14:38

Vincent Bons

1
2

I doubt that this is a good solution. If you can share the PDF, please open an issue in PDFBox JIRA https://issues.apache.org/jira/browse/PDFBOX – Tilman Hausherr Sep 11 '18 at 10:33
Agreed. I should have put it as a comment. Though I think it might be useful to know that this problem might have been introduce between version 2.0.2 and 2.0.9. – Vincent Bons Sep 12 '18 at 11:18
we need the PDF file. Or at least the cmap file, which can be extracted by using PDFDebugger: in the page, go to Resources, Font, to the specific font, then "ToUnicode", and then right click, "Save stream as" and submit that one. – Tilman Hausherr Sep 12 '18 at 12:18

PDFBox throws error while extracting text encoded with font DejaVu Sans Condensed

1 Answers1