0

I have this code in java to take a PDF file and extract all the text:

File file= new File("C:/file.pdf");
PDDocument doc= PDDocument.load(file);
PDFTextStripper s = new PDFTextStripper();
content= s.getText(doc);
System.out.println(content)

If we run the application with Windows, it works correctly and extracts all the text. However, when we pass the app to the server that uses Linux, the spanish accents are converted into "strange" characters like --> "carácter" (it should be "carácter"). I tried to convert the String to bytes and then to UTF8 unicode:

byte[] b = content.getBytes(Charset.forName("UTF-8"));
String text= new String(b);
System.out.println(text);

But it does not work, in Windows it continues working well but in the Linux server it still shows wrong the spanish accents, etc ... I understand that if in a Windows environment it works correctly, in a Linux environment it should have to work too ... Any idea of What can it be or what can I do? Thank you

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
Xavier
  • 101
  • 4
  • https://stackoverflow.com/questions/655891/converting-utf-8-to-iso-8859-1-in-java-how-to-keep-it-as-single-byte You might have to maintain two versions or some other mechanisim to switch between charset. I don't have a Linux server spun up to test. – danny117 Mar 12 '19 at 17:41
  • 1
    I think that whatever happens happens after the extraction. Please write the text into a OutputStreamWriter while taking care to use utf8 encoding instead of using `System.out.println`. Please consider sharing your file and mention what PDFBox version you are using. – Tilman Hausherr Mar 13 '19 at 08:46
  • Also mention what version you are using. In the 1.8 versions, the encoding had to be set in the constructor, but this is no longer needed in 2.0. – Tilman Hausherr Mar 15 '19 at 11:02

1 Answers1

0

á is what you get when the UTF-8 encoded form of á is misinterpreted as Latin-1.

There are two possibilities for this to happen:

  1. a bug in PDFTextStripper.getText() - Java strings are UTF-16 encoded, but getText() may be returning a string containing UTF-8 byte octets that have been expanded as-is to 16-bit Java chars, thus producing 2 chars 0x00C3 0x00A1 instead of 1 char 0x00E1 for á. Subsequently calling content.getBytes(UTF8) on such a malformed string would just give you more corrupted data.

    To "fix" this kind of mistake, loop through the string copying its chars as-is to a byte[] array, and then decode that array as UTF-8:

    byte[] b = new byte[content.length()];
    for (int i = 0; i < content.length(); ++i) {
        b[i] = (byte) content[i];
    }
    String text = new String(b, "UTF-8");
    System.out.println(text);
    
  2. a configuration mismatch - PDFTextStripper.getText() may be returning a properly encoded UTF-16 string containing a á char as expected, but then System.out.println() outputs the UTF-8 encoded form of that string, and your terminal/console misinterprets the output as Latin-1 instead of as UTF-8.

    In this case, the code you shown is fine, you would just need to double-check your Java environment and terminal/console configuration to make sure they agree on the charset used for console output.

You need to check the actual char values in content to know which case is actually happening.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • Variant 1 is extremely implausible because utf-8 is an encoding not in use in pdfs at all (ISO 32000-1) or only in contexts not related to text extraction (ISO 32000-2). The only situation I could imagine would be an invalid pdf using utf-8 in its **ToUnicode** tables. But then the same issue would occur on windows. – mkl Mar 13 '19 at 07:11