Unable to read unicode character in pdf using java

Question

I am trying to convert Pdf document that contains Tamil unicode characters into a word document retaining all the formatting. I am not able to read the unicode character in the Pdf they are appearing as junk character in word. I am using the below code can someone please help?

public static void main(String[] args) throws IOException {
        System.out.println("Document converted started");
        XWPFDocument doc = new XWPFDocument();
        String pdf = "D:\\sample1.pdf";
        PdfReader reader = new PdfReader(pdf);
     //   InputStreamReader isr = new InputStreamReader(reader,"UTF8");
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            TextExtractionStrategy strategy = parser.processContent(i,
                    new SimpleTextExtractionStrategy());
            System.out.println(strategy.getResultantText());
            String text = strategy.getResultantText();
            XWPFParagraph p = doc.createParagraph();
            XWPFRun run = p.createRun();
   //         run.setFontFamily(new Font("Arial"));
            run.setFontSize(14);
            run.setText(text);
     //       run.addBreak(BreakType.PAGE);
        }
        FileOutputStream out = new FileOutputStream("D:\\tamildoc.docx");
        doc.write(out);
        out.close();
        reader.close();
        System.out.println("Document converted successfully");
    }

Is the content of the `String text, junk, too, or is it as expected?` — mkl, Feb 05 '15 at 13:29

Fabrizio Morello · Answer 1 · 2015-02-05T11:27:36.917

0

You can use the library Apache PDFBox https://pdfbox.apache.org/download.cgi . With the component PDFTextStripper, invoking method getText(PDDocument doc) you will obtain a simple String that represents the content of .pdf file

Here an example :

    UploadedFile file = new UploadedFile(fileName);
    InputStream is = file.getInputStream(); 
    PDDocument doc = PDDocument.load(is);
    String content = new PDFTextStripper().getText(doc);
    doc.close();

And after that you can write on your file

edited Feb 05 '15 at 11:27

answered Feb 05 '15 at 11:21

Fabrizio Morello

335
1
5
18

1

I used the above code but I am not getting the reesult – Saravanan s Feb 07 '15 at 14:14
take a look [How to get Unicode of the characters from PDF using java and PDFBox](http://stackoverflow.com/questions/12577092/how-to-get-unicode-of-the-characters-from-pdf-using-java-and-pdfbox?rq=1) – Fabrizio Morello Feb 08 '15 at 10:31

Unable to read unicode character in pdf using java

1 Answers1