import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException;
import org.apache.pdfbox.text.PDFTextStripper;
public class sample {
public static void main(String[] args) throws InvalidPasswordException, IOException {
File file = new File("C:\\sample.pdf");
PDDocument document = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
//java.io.PrintStream p = new java.io.PrintStream(System.out,false,"Cp921");
//p.println(text.toString());
System.out.println(text);
}
}
The text is read from the pdf but while displaying using System.out.println
it shows a different output.
Then I read different posts online and found that it had something to do with encoding and I found a solution at this question: Text extracted by PDFBox does not contain international (non-English) characters but I had to use encoding of Cp921 for Latvian characters but still I have the problem not solved and the output is given in this image
Then I went through the process of debugging and found that the text read from PDF is stored in exact encoding without any changes so I don't know how to display the text with correct encoding. Any help would be great thanks in advance.
Sample PDF content: [Maksātājs, Informācija, Vārdu krājums, Ēģipte, Plašs, Vājš, Brieži, Pērtiķi, Grāmatiņa, šķīvis]
Console output in Eclipse using System.out.println
:
Console output in eclipse using PrintStream
:
P.S. I am beginner programmer and I have not much experience in coding