Unable to print Non English (Latvian) Characters from pdf file correctly in Java using PDFBox?

Question

import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException;
import org.apache.pdfbox.text.PDFTextStripper;
public class sample {
public static void main(String[] args) throws InvalidPasswordException, IOException {
    File file = new File("C:\\sample.pdf");
    PDDocument document = PDDocument.load(file);
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(document);
    //java.io.PrintStream p = new java.io.PrintStream(System.out,false,"Cp921");
    //p.println(text.toString());
    System.out.println(text);
    }
}

The text is read from the pdf but while displaying using System.out.println it shows a different output. Then I read different posts online and found that it had something to do with encoding and I found a solution at this question: Text extracted by PDFBox does not contain international (non-English) characters but I had to use encoding of Cp921 for Latvian characters but still I have the problem not solved and the output is given in this image

Then I went through the process of debugging and found that the text read from PDF is stored in exact encoding without any changes so I don't know how to display the text with correct encoding. Any help would be great thanks in advance.

Sample PDF content: [Maksātājs, Informācija, Vārdu krājums, Ēģipte, Plašs, Vājš, Brieži, Pērtiķi, Grāmatiņa, šķīvis]

Console output in Eclipse using System.out.println:

Console output in Eclipse using System.out.println

Console output in eclipse using PrintStream:

Console output in eclipse using PrintStream

P.S. I am beginner programmer and I have not much experience in coding

geco17 · Accepted Answer · 2018-06-02T20:36:57.117

1

You can change the system out either by modifying the system property file.encoding or by setting the out. Any of the following should work:

-Dfile.encoding=utf-8 (or whatever you need) as a jvm argument
System.setProperty("file.encoding", "utf-8") -- same as (1) but at runtime
System.setOut(new PrintStream(System.out, true, "utf-8")) -- set System.out to whatever print stream you need.

EDIT

Your comment mentions you're writing to a file. To write to a file and specify the encoding, consider something like

try (OutputStreamWriter writer =
         new OutputStreamWriter(new FileOutputStream(new File("path/to/file")), StandardCharsets.UTF_8))
    writer.write(text, 0, text.length());
}

See the documentation here.

edited Jun 02 '18 at 20:36

answered Jun 02 '18 at 15:34

geco17

5,152
3
21
38

Glad it helped you. If you feel it was adequate, please consider marking the answer as accepted. – geco17 Jun 02 '18 at 16:02
I was trying to write the output to a file and it works when I run the program from eclipse but when I export it to Runnable jar it writes to file like in Image 1. Can anyone help me with writing this output to a file and export it to jar – Praveen Kenny Jun 02 '18 at 18:17
1

What have you tried? How are you writing to the file? You need to use a writer configured for utf-8. See https://stackoverflow.com/questions/1001540/how-to-write-a-utf-8-file-with-java – geco17 Jun 02 '18 at 18:58

Unable to print Non English (Latvian) Characters from pdf file correctly in Java using PDFBox?

1 Answers1