Extracting Hebrew text from PDF using apache pdfbox does not return all characters

Question

The code below extracts Hebrew text from http://www.language-brain.com/journal/docs/Gvion_Friedmann_LanguageBrain7_frigvi.pdf without the Hebrew character "ן". All other text seems to be extracted fine. Any ideas?

public class TestPDFUtil {
    @Test
    public void testHebrewPDF() throws Exception {
        String url = "http://www.language-brain.com/journal/docs/Gvion_Friedmann_LanguageBrain7_frigvi.pdf";
        String text = PDFUtil.readPDF(url);
        System.out.println(text);
        Assert.assertTrue(text.indexOf("זיכרון עבודה") != -1);
    }
}

public class PDFUtil {
    public static String readPDF(String url) throws IOException {
        URL urlObj = new URL(url);
        PDDocument document = PDDocument.load(urlObj.openStream());
        document.getClass();
        if( !document.isEncrypted() ){
            PDFTextStripper stripper = new PDFTextStripper();
            String text = stripper.getText(document);
            document.close();
            return text.trim();
        }
        return null;
    }
}

Attaching screen shots that show the missing character. On the left is how the page http://www.language-brain.com/journal/docs/Gvion_Friedmann_LanguageBrain7_frigvi.pdf appears in Crome. On the right is the result of PDF text extraction using the code above.

I'm able to reproduce the problem. Not sure why, but it seems ן is read as the 0x15 NAK character. Have you tried another library? — shmosel, May 10 '17 at 21:02
1) Which letter is it? Could you make a screenshot where you mark that letter? 2) can you get the text with Adobe Reader? 3) https://pdfbox.apache.org/2.0/faq.html#gibberish 4) I have a look at the file with PDFDebugger, some characters "nunfinal","memfinal", "kaffinal", "tsadiffinal","pefinal" don't have a unicode in the font. — Tilman Hausherr, May 10 '17 at 22:52
To stress @Tilman's item 2: I searched for the character `ן` in the PDF using Adobe Reader and it didn't find any occurrence. Thus, even though that character might be there by the *looks* of the document, no glyph in the document actually *is mapped to the Unicode value* of that character `ן`. — mkl, May 11 '17 at 08:50
The test indeed fails and this demonstrates the problem. It should succeed. I haven't tried any other library. Any recommendations? I am not sure I understand how searching works in PDF. I downloaded the PDF and searched for other texts within Adobe Reader - searching does not work. — Jacobs2000, May 12 '17 at 06:18
*"I am not sure I understand how searching works in PDF"* - Searching by Adobe Reader is based on regular text extraction mechanics to have something to search in. If copy&paste or search does not work properly in Adobe Reader, text extraction usually won't either, the PDF simply is incomplete, missing information required for this task. You might want to look into OCR solutions. — mkl, May 12 '17 at 07:34
Maybe creating your own ToUnicode stream would help. However this is a lot of work, it would make sense only if you have a lot of files from the same source and with the same error. https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0 — Tilman Hausherr, May 15 '17 at 12:28

Extracting Hebrew text from PDF using apache pdfbox does not return all characters

0 Answers0

Linked