How to solve no unicode mapping error from PDFBox?

Question

I have an existing PDF file that I would like to convert to excel file using python script. Currently using PDFBox, however there are multiple errors similar to the following:

org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
No Unicode mapping for CID+24 (24) in font DroidSansFallback

Can I substitute the droidsansfallback font or replace the font with another font using pdfbox or other java/python script? Please help.

It is extremely difficult to solve these, see https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0 . The best is to contact the creator of the document to bring up a document that permits proper text extraction. — Tilman Hausherr, Nov 13 '19 at 07:22
Thank you @TilmanHausherr :) would it be possible to use OCR? — Shertilda, Nov 13 '19 at 08:27
@TilmanHausherr, sorry as I am new to this. Is it because the creator of the document missed out the toUnicode cmap hence of this error? — Shertilda, Nov 13 '19 at 08:37
Sure you can OCR it. Try Tesseract. Apache Tika supports this. Yes the creator is at fault. It may even have been intended. — Tilman Hausherr, Nov 13 '19 at 08:48
@TilmanHausherr, thank you for your help :) Will try OCR if it works. Btw other than python/java, any idea of other programming language that is able to automate the conversion of pdf to excel? — Shertilda, Nov 13 '19 at 09:06

score 0 · Answer 1 · answered Jul 31 '22 at 16:34

0

Convert PDF using print to Microsoft pdf file and use that. It will take care of all fonts

answered Jul 31 '22 at 16:34

user19660570

11

1

This method will prevent any non-OCR text extraction because the "printed" pdf file is essentially an image. – T_01 Sep 03 '22 at 17:08

score 0 · Answer 2 · answered Jul 04 '23 at 16:53

I ran into something similar lately when parsing text from PDFs

WARNING: No Unicode mapping for 112 (142) in font AEDNQJ+Palatino-BoldItalic+2

This was causing the output result to be missing certain characters (such as á) in the output

M_s alta que los cielos, m_s honda que la mar,

(Added underscores where the character <á> should have been in the text)

The fix is to regenerate your PDF with all fonts embedded (such as PDF/A), so that all fonts are available at text extraction time.

Example:

public String parsePdf(InputStream pdfStream) {

    try (PDDocument pdfDoc = PDDocument.load(pdfStream)) {
        
        PDFTextStripper textStripper = new PDFTextStripper();
        return textStripper.getText(pdfDoc);

    } catch(IOException e) {
        throw new ParsingException("Unable to load input pdf stream", e);
    }
}

Más alta que los cielos, más honda que la mar,

You can convert an existing PDF to PDF/A using acrobat or the preview tool in macos.

How to solve no unicode mapping error from PDFBox?

2 Answers2

Linked