7

Java 7 is supposed to fix an old problem with unpacking zip archives with character sets other than UTF-8. This can be achieved by constructor ZipInputStream(InputStream, Charset). So far, so good. I can unpack a zip archive containing file names with umlauts in them when explicitly setting an ISO-8859-1 character set.

But here is the problem: When iterating over the stream using ZipInputStream.getNextEntry(), the entries have wrong special characters in their names. In my case the umlaut "ü" is replaced by a "?" character, which is obviously wrong. Does anybody know how to fix this? Obviously ZipEntry ignores the Charset of its underlying ZipInputStream. It looks like yet another zip-related JDK bug, but I might be doing something wrong as well.

...
zipStream = new ZipInputStream(
    new BufferedInputStream(new FileInputStream(archiveFile), BUFFER_SIZE),
    Charset.forName("ISO-8859-1")
);
while ((zipEntry = zipStream.getNextEntry()) != null) {
    // wrong name here, something like "M?nchen" instead of "München"
    System.out.println(zipEntry.getName());
    ...
}
kriegaex
  • 63,017
  • 15
  • 111
  • 202
  • what are best practices for Java SE6? (besides upgrading to SE7 :) – basZero Jan 07 '13 at 09:56
  • For SE6: I tested setting the VM parameters `zip.altEncoding` or `zip.encoding` to `Cp437` or `ISO-8859-1`, both did not help to read correctly – basZero Jan 07 '13 at 10:31
  • @basZero: Apache Commons Compress works nicely. I found no out-of-the-box solution though. – kriegaex Jan 07 '13 at 14:02

1 Answers1

10

I played around for two or so hours, but just five minutes after I finally posted the question here, I bumped into the answer: My zip file was not encoded with ISO-8859-1, but with Cp437. So the constructor call should be:

zipStream = new ZipInputStream(
    new BufferedInputStream(new FileInputStream(archiveFile), BUFFER_SIZE),
    Charset.forName("Cp437")
);

Now it works like a charm.

halfer
  • 19,824
  • 17
  • 99
  • 186
kriegaex
  • 63,017
  • 15
  • 111
  • 202
  • I think you can accept this answer as correct, even though you wrote it yourself, per this article: http://blog.stackoverflow.com/2011/07/its-ok-to-ask-and-answer-your-own-questions/ – seh Jun 30 '12 at 18:30
  • I have the same problem, and take me hours to solve it. Solving was very simple just use MS-DOS encoding for me cp852 instead win cp1250 – Perlos Mar 29 '19 at 09:19
  • Yes, that is the very same problem and the same solution, just not for the English MS-DOS code page 437 but for the Central European code page 852. Of course the exact solution always depends on the environment and tool the ZIP archive in question was generated in/with. – kriegaex Mar 30 '19 at 01:36
  • 1
    The Java behaviour is arguably non-conformant, as the spec seems quite clear that Cp437 is the default when the "Language encoding flag (EFS)" has not been set. "D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437.... D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding" https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT – mrg Mar 09 '21 at 17:59
  • I upvoted your comment because the link is a very helpful resource. To be fair, Java does not claim to try and detect the encoding or even read the EFS but clearly documents that it uses UTF-8 as a default, which is understandable nowadays, especially because it is also the JAR file default. So in Java you got to **know** the encoding ahead of calling the the `ZipInputStream` constructor. Fair enough. What makes your comment so helpful is to know that Cp437 is actually a default, so this should be one of the first encodings to try when there are any problems. – kriegaex Mar 15 '21 at 01:36