8

I would like to distribute my 20-jar application as pack200 files, but I also need to provide file checksums for the sake of validation.

Because I am paranoid (thank you, JWS), I would like to also have checksums on decompressed files.

Is decompression of pack200 deterministic and giving identical results on all platforms (Win/Mac/Linux cross 32/64 bit)?

In other words, can I decompress the files on one computer, compute their checksums, and expect them to always be identical if decompressed at other computers?

EDIT: Thanks for the comments. I am looking for some hard specification to confirm or deny this.

Making assumptions (even based on testing on a few machines) means risk.

Implementations may vary across platforms and Java versions. Even the same implementation can give different results (thinking of order of items in ZIP directory?). That's why I ask whether it's the same for all platforms and Java versions AND deterministic.


If this cannot be confirmed or denied, how about this follow-up question. How can I verify that after decompression a jar is valid? Thinking of half-finished files, gamma rays corrupting single bits in the file and whatnot.

Konrad Garus
  • 53,145
  • 43
  • 157
  • 230
  • The Jar have checksum in them for every entry. If there didn't unpack the same across all platforms they would be corrupt and should be detected as so. – Peter Lawrey May 27 '11 at 11:42
  • @Peter thanks. Does it imply that the whole jars are always identical? – Konrad Garus May 27 '11 at 11:45
  • have you tried unpacking the same pack200 archive on different machines and different java versions? – Denis Tulskiy May 27 '11 at 11:48
  • pack200 is itself is a Java application which should run the same across all platforms. It is hard to imagine why it would have platform dependant behaviour. I would assume it is the same unless you know otherwise. – Peter Lawrey May 27 '11 at 11:48
  • @tulskiy I'm looking for some hard reference here. I can test it on 5 systems, but it's not necessarily universally reliable. My users might be on a 6th untested system that gives different results, or different versions of Java might work differently... So I need some hard specs that say "yes, it's always identical". – Konrad Garus May 27 '11 at 11:50
  • @Konrad Garus: Since the decompressed files *depend* on being byte-for-byte identical with the original (what with bytecode and whatnot), I can't imagine that a lossy compression algorithm would be chosen. I haven't taken a look into the source code, so it's not "hard specs". – Piskvor left the building May 27 '11 at 11:56
  • @Peter, @Piskvor What about the ZIP (JAR) file itself? – Konrad Garus May 27 '11 at 12:02
  • i can tell it's the same, but if you need hard evidence you can read the doc of jsr200; edit jar/zip format is also document, read the specification. – bestsss May 27 '11 at 12:13
  • Can't you just checksum class files, not the jar itself? I think it's built-into the jar or something? – Denis Tulskiy May 27 '11 at 12:13

2 Answers2

7

Think that's what you're looking for.

...However, for any given Pack200 archive, every decompressor is required to produce a particular byte-wise image for each class file transmitted. This requirement is placed on decompressors in order to make it possible for compressors to transmit information, such as message digests, which relates to the eventual byte-wise contents of transmitted class files. This section describes the restrictions placed on every decompressor that makes the byte-wise contents of its output files a well-defined function of its input.

This means you can do what you want to do here. JNF/Pack200 works by taking out constants that are shared across classes and intelligently compressing the .class files - but this portion of the standard says that while it COULD be possible to reconstruct class files several ways, this would lead to not being able to verify these files with digests. To avoid that issue, Pack200 explicitly specifies how decoding should work - so while the output .class files may not be identical to the input .class files, every Pack200 decompressor's outputted .class files will match every other Pack200 decompressor's output .class files.

So your best bet is to Pack 'em with Pack200, unpack them, then do MD5 or comparable digest algorithm, and use that to verify the unpacked files.

Hope that answers your question!

Travis
  • 544
  • 3
  • 10
  • Great, thanks. What about the JAR (ZIP) file itself? Is that guaranteed to be identical, or only the classes it contains? Because spec seems to cover only classes, so hashing the whole file may be unreliable. – Konrad Garus May 27 '11 at 13:00
  • You mean the JAR that comes out of the Pack200 decoding? Yes, it will be identical for all platforms - because of the decompressor outputs. It may not, however, be identical to the input Jar - so once again, Pack200 the jar, decompress, THEN take a digest and use that. The reason is that Pack200 is not actually compressing the Jar - it's taking it apart, taking the .class files, recompressing them, then building a new Jar when it decompresses - Pack200 is for transport, Jar is for JVM loading. So the jars built are subject to the above restrictions as well - this includes file orders etc. – Travis May 27 '11 at 13:07
  • The pasted spec seems to only apply to individual class files, not JAR as a whole. – Konrad Garus May 27 '11 at 13:14
  • Pretty sure that it's true, even if the spec does not explicitly state it - but you're probably right! If you're worried - you can use the pack200 utility with the -repack option to just compress the JAR files with GZIP instead of converting them to full Pack200 archives. If you do this, you can just checksum the original Jars, call Pack200 with --repack option, and then when you unpack them on other side, checksum the jar that comes out. See: http://download.oracle.com/javase/1.5.0/docs/tooldocs/share/pack200.html – Travis May 27 '11 at 15:07
  • Thanks for the effort. It helps, but I find Stephen's less optimistic answer more accurate. – Konrad Garus May 30 '11 at 16:01
1

I am looking for some hard specification to confirm or deny this.

@Travis's answer says that the reconstructed class files are not byte-for-byte identical to the original class files, and this (obviously) means that the JAR files won't be identical either.

Furthermore, none of the documentation says that unpack200 will produce identical JAR files across all platforms, and I wouldn't expect it to. (For a start, different platforms will be running different versions of unpack200 ...)

If this cannot be confirmed or denied, how about this follow-up question. How can I verify that after decompression a jar is valid? Thinking of half-finished files, gamma rays corrupting single bits in the file and whatnot.

I don't think there's a way to do this either. If we assume that regenerated JAR files may be platform dependent, then we've no baseline to generate a checksum from.

I think your best bet is to send a high quality checksum of the pack200 file, and trust that the unpack200 will either work correctly or will set a non-zero exit code when it fails ... like any correctly written utility should do.

BTW, if you are that worried about random errors, how are you going to detect "cosmic ray" effects when the JVM loads code from the JAR files? The sensible approach is to use ECC memory, etc and leave this to the hardware to deal with.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Thanks, that's what I thought. Looks like I could verify contents of signed jar though (http://stackoverflow.com/questions/1374170/how-to-verify-a-jar-signed-with-jarsigner-programmatically/1796775#1796775). May be good enough - transfer .pack.gz with checksum, then unpack and verify using that method. Cosmic rays was an exaggeration, but I'm not extremely confident in files lying on the disk for too long or being half-extracted. – Konrad Garus May 27 '11 at 13:41
  • Yea ... assuming you don't mean a signed JAR that you've pack200'd – Stephen C May 28 '11 at 16:05