Squares/dots/blank as output while copying and pasting the text manually from PDF

Question

I am trying to extract text from PDF. First i tried with PDFBox. In output, I have found that some part of text was missing and on eclipse console I got the following warnings

No Unicode mapping for CID+49 (49) in font Helvetica

I wanted to know that what does above warning mean. I googled to know the meaning. But still i am not clear.It would be very helpful if someone provides a clear explanation.

For the same PDF,I got squares or dots shapes when I copied and pasted the text from PDF manually.I am wondering why this has happen. Please explain.

see the comments here https://stackoverflow.com/questions/39324398/issue-with-reading-some-unicode-characters-out-of-a-pdf-using-pdfbox — Tilman Hausherr, Sep 19 '16 at 10:11
Essentially the information in the pdf concerning the font in question is too deficient for text extraction unless based on ocr. — mkl, Sep 19 '16 at 10:37

score 0 · Answer 1 · answered Sep 19 '16 at 09:36

0

You can try to use the org.apache.pdfbox.text.PDFTextStripper package which have a method that can return all the text available in your pdf document automatically. the String getText(PDDocument doc) method can help you greatly . follow this link to go through the API PDF TextStripper . hope it will help

answered Sep 19 '16 at 09:36

Joseph Peter

133
7

Yeah.I tried. From that approach only, I missed some part of the text in the final output. – sagar Sep 19 '16 at 09:47
This answer isn't helpful, he used PDFTextStripper, obviously. – Tilman Hausherr Sep 19 '16 at 10:12
may be you should try going deeper into the Api for something that can help you better – Joseph Peter Sep 19 '16 at 16:54

Squares/dots/blank as output while copying and pasting the text manually from PDF

1 Answers1