0

I am trying to extract text from PDF. First i tried with PDFBox. In output, I have found that some part of text was missing and on eclipse console I got the following warnings

No Unicode mapping for CID+49 (49) in font Helvetica

I wanted to know that what does above warning mean. I googled to know the meaning. But still i am not clear.It would be very helpful if someone provides a clear explanation.

For the same PDF,I got squares or dots shapes when I copied and pasted the text from PDF manually.I am wondering why this has happen. Please explain.

sagar
  • 115
  • 1
  • 1
  • 10
  • see the comments here https://stackoverflow.com/questions/39324398/issue-with-reading-some-unicode-characters-out-of-a-pdf-using-pdfbox – Tilman Hausherr Sep 19 '16 at 10:11
  • Essentially the information in the pdf concerning the font in question is too deficient for text extraction unless based on ocr. – mkl Sep 19 '16 at 10:37

1 Answers1

0

You can try to use the org.apache.pdfbox.text.PDFTextStripper package which have a method that can return all the text available in your pdf document automatically. the String getText(PDDocument doc) method can help you greatly . follow this link to go through the API PDF TextStripper . hope it will help

Joseph Peter
  • 133
  • 7