-1

i reading pdf documents via ItextSharp library. But these documents is in Czech language which use diacritic (ř ě ž š č etc.) How I can read this chars? Any idea? Or, is some solution for replacing this chars for normal r e z s c ? This is code in my method. Thanks

 PdfReader reader = new PdfReader("M:/ShareDirs_KSP/RDM_Debtors/DMS_PROD/" + src);

    // we can inspect the syntax of the imported page
    String text = new String();
    for (int page = 1; page <= 1; page++) {

        text += PdfTextExtractor.getTextFromPage(reader, page);
    }

    reader.close();
Chris Haas
  • 53,986
  • 12
  • 141
  • 274
Edák Edák
  • 33
  • 3
  • 9
  • If the PDF was created correctly, then the chars should be parsed correctly. Which version of iText are you using? Is the font stored in the PDF as a simple font or a composite font? Read http://stackoverflow.com/questions/26631815/cant-get-czech-characters-while-generating-a-pdf if you don't know the difference. – Bruno Lowagie Oct 31 '14 at 09:10
  • I have 5.5.2 version. Im not writing but I just reading. Where I can set coding? – Edák Edák Oct 31 '14 at 09:21

1 Answers1

1

I have written a small proof of concept that parses the file czech.pdf. This file contains several characters with diacritics. It was created in answer to the following question: Can't get Czech characters while generating a PDF

The text is stored in the file twice: once using a simple font, once using a composite font. In my proof of concept (named ParseCzech), I parse this PDF to a file encoded using UTF-8 (UNICODE):

public void parse(String filename) throws IOException {
    PdfReader reader = new PdfReader(filename);
    FileOutputStream fos = new FileOutputStream(DEST);
    for (int page = 1; page <= 1; page++) {
        fos.write(PdfTextExtractor.getTextFromPage(reader, page).getBytes("UTF-8"));
    }
    fos.flush();
    fos.close();
}

The result is the file czech.txt:

enter image description here

As you can see from the screen shot, the text is extracted correctly (but make sure that the viewer you use knows that the file is encoded as UTF-8, otherwise you may see strange characters instead of the actual text).

Note that some PDFs do not allow text to be extracted correctly. This is explained in the following video: http://www.youtube.com/watch?v=wxGEEv7ibHE

Please share your PDF so that people on StackOverflow can check whether you don't succeed to extract text because of an error in your code, or whether you don't succeed because the PDF doesn't allow you to extract the text.

Community
  • 1
  • 1
Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • Unfortunately this documents contains very secret information (there were sending from courts) I add .getbytes("UTF-8") but my text variable contains only "[B@1d14147" :/ – Edák Edák Oct 31 '14 at 10:05
  • If you are a customer, you can share the document with the paid support team under an NDA. If you are a user, why don't you take a look at the document using RUPS? If the document contains secret information, the text may have been obfuscated on purpose, in which case you won't be able to extract it. Watch the video if you want to understand what I mean by that. – Bruno Lowagie Oct 31 '14 at 10:33
  • 1
    *my text variable contains only "[B@1d14147"* - that looks like your text variable is a byte array and you try to print it as is or its toString value. That obviously cannot work. – mkl Oct 31 '14 at 20:33
  • @mkl That's so obvious that I overlooked it. – Bruno Lowagie Oct 31 '14 at 21:08