read pdf written in chinese using java

Question

i want to read pdf file which is written in chinese. i am currently using apache PDFBox. when i try to read and print the PDF content, it does not print the content. instead i get a warning messgae as below

"org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode WARNING: No Unicode mapping for CID+10 (10) in font JCQDML+NotoSansCJKtc-Medium WARNING: No Unicode mapping for CID+40398 (40398) in font EYYJPZ+NotoSansCJKtc-DemiLight"

I understand it is not able to find the font. so i have added the fontbox jar dependency as well. still i get same message.

Can anyone help me with how to proceed on this?

EDITED: adding debugger output

no unicode character is shown. this is the problem? please correct me if I am wrong.

If you use Apache PdfBox, why did you tag your question as an iText question? That's like bringing a cheap model of a Huawei phone to an Apple Store asking for free advice. — Bruno Lowagie, Aug 23 '17 at 06:25
if it can be done using iText i can use that also, i actually tried using iText also, added font-asian jar as well, but even that didn't give me correct output. that is why i tagged iText also. both are APIs for PDF — ragini vyas, Aug 23 '17 at 06:53
Which version of iText did you use and why would you need font-asian.jar to extract text? That jar is ancient, and only needed to create PDFs, not to extract text. Maybe your PDF isn't created correctly. Do the fonts have a toUnicodeMap? (They should. If they haven't, you may never be able to extract text correctly.) — Bruno Lowagie, Aug 23 '17 at 07:05
Suppose that you put a broken DVD disk in a DVD player. If you can't play that disk, would you blame the DVD or the DVD player? The smart answer is: the DVD is broken, not the DVD player. In your case, you put a PDF without Unicode mapping into a tool to extract text. That tool tells you: "Hey, this PDF doesn't have a Unicode mapping!" But instead of saying "OK, my PDF is bad", you say: "hey, my PDF tools are bad." Does that make sense to you? — Bruno Lowagie, Aug 23 '17 at 07:10
thank you @BrunoLowagie. i am a naive on this and just wanted some guideline. Let me try with what you have suggested. Thanks — ragini vyas, Aug 23 '17 at 07:19
@raginivyas Please read this: https://pdfbox.apache.org/2.0/faq.html#notext — Tilman Hausherr, Aug 23 '17 at 07:36
I removed the iText tag because the question is not about iText. If you want to ask a similar question, but about iText, then create a new question. — Amedee Van Gasse, Aug 23 '17 at 07:57
@raginivyas see also https://stackoverflow.com/a/15566820/1729265 and https://stackoverflow.com/a/30804279/1729265 . Extreme measures: https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0 — Tilman Hausherr, Aug 23 '17 at 08:11
@BrunoLowagie *"why would you need font-asian.jar to extract text? That jar is ancient, and only needed to create PDFs, not to extract text"* - Recently I observed differently, cf. [this answer](https://stackoverflow.com/a/45801151/1729265): That jar is needed for extracting text from PDFs with Type0 fonts without **ToUnicode** but with **Identity-H** encoding and a **CIDSystemInfo** with one of the standardized ROS-triples. I have to admit, though, that the PDF from that question was the first one in which I've observed that need yet. — mkl, Aug 23 '17 at 13:04
i have edited the question and added the output of debugger. i see that the unicode character column is empty and encoding is Idenity-H. could someone please help me what encoding PDF docs should have ideally and whether the missing unicode character is only the root cause why i am not able to read this file? — ragini vyas, Aug 23 '17 at 13:09
@raginivyas Please simply share the file in question for analysis. — mkl, Aug 23 '17 at 13:16
@mkl I have kept the file here: https://drive.google.com/file/d/0B6k7AYGPEth2djFMNVJ0dC1wLVU/view?usp=sharing — ragini vyas, Aug 24 '17 at 06:32
I cannot reproduce an issue with your file, I successfully could open and render the PDF. Please provide a [sscce](http://sscce.org/) and environment information (PDFBox version, Java version, ...) to allow reproducing the issue. — mkl, Aug 24 '17 at 09:01
I am using Java 1.8, PDF box 2.0.6 on windows. using below code: String file1 = "PathToChinese_pdf";\n PDFTextStripper pdfStripper = null; PDDocument pdDoc = null; PDFParser parser = new PDFParser(new RandomAccessFile(new File(file1),"r")); parser.parse(); pdfStripper = new PDFTextStripper(); pdfStripper.setStartPage(1); pdfStripper.setEndPage(5); String parsedText = pdfStripper.getText(pdDoc); System.out.println(parsedText); — ragini vyas, Aug 24 '17 at 09:26
@raginivyas Text Extract works too. The file you linked to (28 pages) is not the file from the PDFDebugger output (12 pages). — Tilman Hausherr, Aug 24 '17 at 13:11
pls take this file: https://drive.google.com/open?id=0B6k7AYGPEth2cVp1cEhXbF96bGc — ragini vyas, Aug 25 '17 at 05:07
@raginivyas the second one does have the problem. The first one (good) was created by iText 5.5.9, the second one (bad) was created by "iPhone OS 10.2.1 Quartz PDFContext". If you open the files with PDFDebugger you can see that the good file has a ToUnicode stream at `Root/Pages/Kids/[0]/Kids/[0]/Resources/Font/F1/ToUnicode` while the bad file doesn't. SO SAD! — Tilman Hausherr, Aug 25 '17 at 10:31

read pdf written in chinese using java

0 Answers0