8

I have an existing PDF file that I would like to convert to excel file using python script. Currently using PDFBox, however there are multiple errors similar to the following:

org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
No Unicode mapping for CID+24 (24) in font DroidSansFallback

Can I substitute the droidsansfallback font or replace the font with another font using pdfbox or other java/python script? Please help.

Shertilda
  • 81
  • 1
  • 1
  • 3
  • 2
    It is extremely difficult to solve these, see https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0 . The best is to contact the creator of the document to bring up a document that permits proper text extraction. – Tilman Hausherr Nov 13 '19 at 07:22
  • 1
    Thank you @TilmanHausherr :) would it be possible to use OCR? – Shertilda Nov 13 '19 at 08:27
  • 1
    @TilmanHausherr, sorry as I am new to this. Is it because the creator of the document missed out the toUnicode cmap hence of this error? – Shertilda Nov 13 '19 at 08:37
  • 1
    Sure you can OCR it. Try Tesseract. Apache Tika supports this. Yes the creator is at fault. It may even have been intended. – Tilman Hausherr Nov 13 '19 at 08:48
  • 1
    @TilmanHausherr, thank you for your help :) Will try OCR if it works. Btw other than python/java, any idea of other programming language that is able to automate the conversion of pdf to excel? – Shertilda Nov 13 '19 at 09:06
  • 1
    Sorry, I don't know. I would have to google too :-) – Tilman Hausherr Nov 13 '19 at 09:17
  • Okie, no worries :) Thank you @TilmanHausherr – Shertilda Nov 13 '19 at 09:24

2 Answers2

0

Convert PDF using print to Microsoft pdf file and use that. It will take care of all fonts

  • 1
    This method will prevent any non-OCR text extraction because the "printed" pdf file is essentially an image. – T_01 Sep 03 '22 at 17:08
0

I ran into something similar lately when parsing text from PDFs

WARNING: No Unicode mapping for 112 (142) in font AEDNQJ+Palatino-BoldItalic+2

This was causing the output result to be missing certain characters (such as á) in the output

M_s alta que los cielos, m_s honda que la mar,

(Added underscores where the character <á> should have been in the text)

The fix is to regenerate your PDF with all fonts embedded (such as PDF/A), so that all fonts are available at text extraction time.

Example:

public String parsePdf(InputStream pdfStream) {

    try (PDDocument pdfDoc = PDDocument.load(pdfStream)) {
        
        PDFTextStripper textStripper = new PDFTextStripper();
        return textStripper.getText(pdfDoc);

    } catch(IOException e) {
        throw new ParsingException("Unable to load input pdf stream", e);
    }
}

Más alta que los cielos, más honda que la mar,

You can convert an existing PDF to PDF/A using acrobat or the preview tool in macos.