30

I'm downloading zipped files containing XMLs, and I'd like to avoid writing the zip files to disk before manipulating them because of latency requirements. However, java.util.zip doesn't suffice for me. There's no way to say "here's a byte array of a zip file, use it" without turning it into a stream, and ZipInputStream is not reliable, since it scans for entry headers (see discussion below EDIT for reasons why that is not reliable).

I do not yet have access to the zip files I'll be handling, so I don't know whether I'll be able to handle them through the ZipInputStream, and I need to find a solution that will work with any valid ZIP files, as the penalty for a failure once I go into production will be high.

Assuming ZipInputStream won't work, what can I do to solve this problem in cases where there are no entry headers? I'm using Wikipedia's definition, which includes a comment on how to correctly uncompress zip files (quoted below), as the standard.

EDIT

The Apache Commons Zip library has a good write up on some of the problems using Stream (both their solution and Java's) has. I'll further add, from wikipedia and personal experience, and the size and crc field on entry headers may not be filled (I've files with -1 in these fields). Thanks to centic for providing this link.

Also, let me quote the wikipedia on the subject:

Tools that correctly read zip archives must scan for the signatures of the various fields, the zip central directory. They must not scan for entries because only the directory specifies where a file chunk starts. Scanning could lead to false positives, as the format doesn't forbid other data to be between chunks, or uncompressed stream containing such signatures.

Note that ZipInputStream scans for entries, not the central directory, which is the problem with it.

Final Edit

If anyone is interested, this script can be used to produce a valid ZIP file that cannot be read by ZipInputStream from an existing ZIP file. So, as a final edit to this closed question, I needed a library that can read files such as the ones produced by this script.

Community
  • 1
  • 1
Daniel C. Sobral
  • 295,120
  • 86
  • 501
  • 681
  • 9
    In practice I haven't encountered a zipped archive that `ZipInputStream` could not read. Perhaps it happens, but I'd suggest that it might be a rare occurrence. The only real issue I've noticed with it is that improperly synchronized access to a single `ZipInputStream` instance can trigger a concurrency exception in native code, which promptly brings the entire JVM crashing to a halt. Note that Java uses these same classes for loading classes out of JAR files, so one would expect them to be fairly robust, when used properly. – aroth Aug 19 '12 at 23:47
  • Unclear: Do you have a memory image of an entire zip file, or just a zip file member (ie, single compressed file)? In any event, you should be able to create a ZipInputStream from a ByteArrayInputStream. – Hot Licks Aug 19 '12 at 23:49
  • @HotLicks I'm downloading a zip file, so I have it all in memory. Using a `ZipInputStream` has problems, as I reported. – Daniel C. Sobral Aug 20 '12 at 14:34
  • Is what you're downloading potentially that ill-formed that ZIS won't work? – Hot Licks Aug 20 '12 at 16:14
  • You could presumably use sun.misc.URLClassPath getResource to return a Resource object that represents the zipped file already unpacked. – Hot Licks Aug 20 '12 at 16:15
  • 1
    Are you worried that you could get malicious zip data? – nalply Aug 21 '12 at 18:34
  • Maybe you could answer some of our questions. – Hot Licks Aug 21 '12 at 21:20
  • You could presumably use sun.misc.URLClassPath getResource to return a Resource object that represents the zipped file already unpacked. – Hot Licks Aug 22 '12 at 01:23
  • 1
    Your other option is to write your own unzip code. The spec is on the internet. I've written such code. It's actually a fun project, since, if you're clever, it can be done quite compactly. – Hot Licks Aug 22 '12 at 01:26
  • I would try using the 7-zip decompressor: http://www.7-zip.org/sdk.html – djangofan Sep 05 '12 at 23:48
  • 4
    Where did you look at the format which suggests that the entry data is optional? Note that the ability for some tools to work with a file *doesn't* guarantee that it's valid. – Jon Skeet Sep 06 '12 at 05:56
  • 2
    @DanielC.Sobral: I'll edit my answer to address that. It sounds like you're effectively making impossible demands here. – Jon Skeet Sep 08 '12 at 06:37
  • 1
    Please provide download links for some of the ZIP archives ZIS cannot handle, so we have test cases for alternative solutions. – kriegaex Sep 08 '12 at 11:29
  • @JonSkeet Impossible demands? Any library that reads the central directory then goes back to the proper offset will suffice. There's nothing impossible about it. Now, either there's a library in Java that does that with a byte array, or there isn't. If there's not, it's certainly not out of an intrinsic difficulty of the problem. I'm feeling a certain defensiveness towards the Java library, trying to reframe my question as to avoid it. That is beyond silly. – Daniel C. Sobral Sep 09 '12 at 00:27
  • @DanielC.Sobral: So you're limit yourself to zip files which are *only* invalid in certain well-defined ways? (Not having entry headers, which still *don't* seem optional to me - them being *contiguous* seems to be optional.) My point is that as soon as you say "I might further add that whether the zip is valid or not is not my concern. Working with it is." you're basically inviting failure. Being more restrictive is fine. Anyway, see my edited answer for more details around that and a suggested solution. – Jon Skeet Sep 09 '12 at 07:14
  • 4
    This deserves reopening, definitely. Its a completely valid question. – Richard J. Ross III Sep 12 '12 at 17:00
  • 1
    Judging by the level of speculation, extended discussion and the OP admitting that "I do not yet have access to the zip files I'll be handling" this was closed for all the right reasons. Also the last paragraph of Jon Skeet's answer says it all *"Basically, you need to pin the problem down more tightly before it's feasible to even say whether a particular library is a valid solution"*. – Kev Sep 12 '12 at 22:47
  • @Kev I fail to see the speculation. Libraries that decompress zip files using entry headers are not valid (see wikipedia, now quoted into the answer). I want a library that can correctly decompress a zip file without going through the filesystem. And, no, I don't have the file I need to decompress to test with: it will only be available on one day, the election day, at which point either my application works, or I'll have to explain myself to the people depending on it. Closing the question because some refuse to accept the brokenness of ZIS is the only wrong thing here. – Daniel C. Sobral Sep 13 '12 at 01:43
  • If you think we are wrong then please bring this up on [meta]. – Kev Sep 13 '12 at 01:57
  • @DanielC.Sobral: Libraries that decompress zip files *assuming that entries are contiguous* are not valid. That's not the same as "using entry headers". Not *scanning* for entries isn't the same as not *using* entries. That's a point I've been trying to make repeatedly, and you've repeatedly ignored it. When you then said you didn't care whether a zip file was valid or not, you just had to be able to read it, *that's* when the question became impossible to answer. Now I get the impression you've backed away from that position somewhat, but you should clarify it IMO. – Jon Skeet Sep 13 '12 at 02:45
  • 2
    @DanielC.Sobral: "*I don't have the file I need to decompress to test with ... at which point either my application works, or I'll have to explain myself to the people depending on it.*" That's the worst development strategy I've ever heard of. People aren't asking for the *exact* file you'll have in production use; they're asking for a *test file*. And if you can't get a reasonable test file, then your code will be *untested*. And if your untested code will truly be used on election night... well, I sincerely hope that any voting system relying on your code isn't in a critical district/state. – Nicol Bolas Sep 13 '12 at 02:46
  • 1
    @NicolBolas I do have test files, though I haven't generated one that can't be read with ZIS yet. That's not the point: the point is that I won't have a sample of the *actual* files until that day. As a personal experience with the previous election, I know for a fact that the test samples provided may differ in critical aspects from the production files -- it happened before, with files in a different format, so it may happen again. I have to rely not only on my tests, but on the reliability of the libraries I use. – Daniel C. Sobral Sep 13 '12 at 02:55
  • 2
    So you're still asking for something that will work with with file formats you cannot test against because you won't have the "actual" files until election day? And no responsibility is placed on the people providing you the test and actual files? And you feel your job is threatened by this? And you're not looking for a new job? – Dave Newton Sep 13 '12 at 04:30
  • @DaveNewton Yes, no, yes, no. If I write the file to disk and use `ZipFile`, it will work. If I write them to disk and call shell an `unzip` process, it will work. Because these two work correctly: they read the end central directory record, then the central directory, and then the entries. A lot of people will be doing that, which is why there's value in doing more. It is often the case that high value comes with high risk. – Daniel C. Sobral Sep 13 '12 at 04:40
  • @HotLicks In the end, I wrote my own class as you suggested. I delegated inflating files to `java.util.zip.Inflate`, and handled everything else. Once the time pressure is off, I'll probably put it on github, after polishing it. – Daniel C. Sobral Sep 28 '12 at 00:09
  • @kriegaex Though the question is closed, I've put a script that produces samples of ZIP files that cannot be read by `ZipInputStream`, due to the way it works. – Daniel C. Sobral Sep 28 '12 at 16:36
  • You mean you actually wrote a Scala program for manipulating ZIP files so as to prove that `ZipInutStream` cannot decode them? Impressive, but why? Maybe because you could not find any such files in the wild? I really think you are trying to solve a non-problem. Which widely known ZIP packer produces such files and with which settings? Sorry, this problem of yours is a bit too esoteric for me. – kriegaex Sep 28 '12 at 18:43
  • @kriegaex Since I cannot attach a zip file, the only option is to attach a program that creates the zip file with the proper characteristics. Please note that any zip program can handle the files correctly, but not `ZipInputStream`. Don't you think you are trying too hard to deny the existence of the problem? You asked for the file, and now I provided it. – Daniel C. Sobral Sep 28 '12 at 21:45
  • False statement. Quoting myself: "Please provide download links for some of the ZIP archives ZIS cannot handle." And: "Which widely known ZIP packer produces such files and with which settings?" With those types of information you would have helped us help you. Your proprietary script which noone else but you uses does not prove anything. As I said, it is just esoteric. – kriegaex Sep 29 '12 at 08:28
  • @kriegaex I don't need to prove anything, since the ZIP specification already does that. Apache Commons even goes to the length of explaining what kinds of problems ZIS has. The script itself illustrates perfectly what kind of thing can prevent ZIS from working. Now, if you prefer to use broken stuff, then, please, just leave the programming field before you inflict pain on all of us. – Daniel C. Sobral Oct 02 '12 at 23:23
  • @kriegaex As for well known ZIP packer that produces such files, try [WinZip Self Extractor](http://www.winzip.com/prodpagese.htm). – Daniel C. Sobral Oct 02 '12 at 23:44
  • 2
    Appearently apache commons-compress has changed with version 1.5. I am now able to read files that I couldn't read before. Starting with version 1.5 ZipArchiveInputStream will try to read the archive up to and including the "end of central directory" record. – Jasper Krijgsman May 28 '13 at 09:17

4 Answers4

24

EDIT: Another suggestion...

Looking at ZipFile from the Apache Commons implementation, it looks like it wouldn't be too hard to effectively fork that for your project. Create a wrapper around your byte array which has all the pieces of the RandomAccessFile API which are required (I don't think there are very many). You've already indicated that you prefer the interface to ZipFile, so why not go with that?

We don't know enough about your project to know whether this opens up any legal questions - and even if you gave details, I doubt that anyone here would be able to give good legal advice - but I suspect it wouldn't take more than an hour or two to get this solution up and working, and I suspect you'd have reasonable confidence in it.


EDIT: This may be a slightly more productive answer...

If you're worried about the entries not being contiguous, but don't want to handle all the compression side yourself, you might consider an option where you effectively rewrite the data. Create a new ByteArrayOutputStream, and read the central directory at the end. For each entry in the central directory, write out an entry (header + data) to the output stream in a format that you believe ZipInputStream will be happy with. Then write a new central directory - if you want your replacement to be valid you may need to do this from scratch, but if you're using code which you know won't actually read the central directory, you could just provide the original one, ignoring the fact that it might not then be valid. So long as it starts with the right signature, that's probably good enough :)

Once you've done that, convert the ByteArrayOutputStream into a new byte[], wrap it in a ByteArrayInputStream and then pass that to ZipInputStream or ZipArchiveInputStream.

Depending on your purposes, you may not even need to do that much - you may be able to just extract each file as you go by creating a "mini" zip file with just the one entry you're reading from the directory at a time.

This does involve understanding the zip file format, but not completely - just the skeleton, effectively. It's not a quick and easy fix like using an existing API completely, but it shouldn't take very long. It doesn't guarantee it'll be able to read all invalid files (how could it?) but it will protect you against the "data between entries" issue you seem to be particularly concerned about. Hope it's at least a useful idea...


there's no way to say "here's a byte array of a zip file, use it"

Yes there is:

byte[] data = ...;
ByteArrayInputStream byteStream = new ByteArrayInputStream(data);
ZipInputStream zipStream = new ZipInputStream(byteStream);

That leaves the issue of whether ZipInputStream can handle all the zip files you'll give it - but I wouldn't write it off quite so quickly.

Of course, there are other APIs available. You may want to look at Apache Commons Compress, for example. Even though ZipFile requires a file, ZipArchiveInputStream doesn't - so again, you could use a ByteArrayInputStream. EDIT: It looks like ZipArchiveStream doesn't read from the central directory either. I was hoping it would use markSupported to check beforehand, but it appears not to...

EDIT: In the comments on the question, I asked where you'd read that the zip file doesn't have to contain entry data. You quoted wikipedia:

"Tools that correctly read zip archives must scan for the signatures of the various fields, the zip central directory. They must not scan for entries because only the directory specifies where a file chunk starts. Scanning could lead to false positives, as the format doesn't forbid other data to be between chunks, or uncompressed stream containing such signatures."

That's not the same as entry data being optional. It's saying that there may be extra data in awkward places, not that the entries may be missing completely. It's basically saying that the entries shouldn't be assumed to be contiguous. I could happily concede that ZipInputStream may not be reading the central directory at the end of the file, but finding code which does that isn't the same as finding code which copes with entry data not existing.

You then write:

I might further add that whether the zip is valid or not is not my concern. Working with it is.

... which suggests you want code which will handle invalid zip files. Combined with this:

I do not yet have access to the zip files I'll be handling, so I don't know whether I'll be able to handle them through the stream

That means you're asking for code which should handle zip files which are invalid in ways you can't even predict. Just how invalid would it have to be for you to be able to reject it? If I give you 1000 random bytes, with no attempt for them to be a zip file at all, what on earth would you do with it?

Basically, you need to pin the problem down more tightly before it's feasible to even say whether a particular library is a valid solution. It's reasonable to collect a set of zip files from various places, which may be invalid in well-understood ways, and say "I must be able to support all of these." Later you may need to do some work if it turns out that wasn't good enough. But to be able to support anything, however broken, simply isn't a valid requirement.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I didn't see your edit. I honestly don't know how can I pin the problem down more tightly. I want something that decompresses all valid ZIP files correctly -- which ZIS doesn't. And, yes, if it turns out to be an invalid ZIP file, I'll still have to deal with it, but I'll be in a better position if I don't handicap myself beforehand. – Daniel C. Sobral Sep 13 '12 at 01:50
  • @DanielC.Sobral: Right, so given the one way you've explained in which `ZipInputStream` *wouldn't* handle valid files with extra data between chunks, and the way my edit suggests you handle that, is there any other kind of valid file which you think wouldn't work with the suggestion I've made? – Jon Skeet Sep 13 '12 at 02:29
  • No, the way you suggest should work. However, I'm asking if there are *libraries* that do that for me (it's right there, third paragraph), for two reasons: I don't have much time to do that, and I suspect the chance of letting a fatal bug into production doing it myself might be higher than the chance of getting a file ZIS can't handle. TrueZIP, if it works, is a better answer. Someone suggested Common VFS + ZipFile on Twitter, after the question was closed, which is also a pretty good idea. – Daniel C. Sobral Sep 13 '12 at 02:42
  • @DanielC.Sobral: I've suggested yet another option which might be simpler - see the edit (at the top). One downside of TrueZIP is that it requires Java 7 - are you using Java 7? If it works for you, go for it - I tried reading the documentation and got lost fairly quickly. – Jon Skeet Sep 13 '12 at 02:53
2

TrueZIP library provides alternative mature zip implementation.

It also features file system abstraction even for HTTP.

For example:

Path path = new TPath(new URI("http://acme.com/download/everything.zip/entry.xml"));
try (InputStream in = Files.newInputStream(path)) {
    // Read archive entry contents here.
    ...
}

So, if you are interested only in specific entries, it would download them only, saving bandwidth and time. And you would not have to write downloading code.

See also http://truezip.java.net/faq.html#http.

Vadzim
  • 24,954
  • 11
  • 143
  • 151
  • Sadly, while this may be a valid answer (I'll go over that faq later), it doesn't really help me, since all my I/O is asynchronous. So, unless it provides an asynchronous I/O interface to replace mine, I can't use it. I'll still accept the answer if it works, though. – Daniel C. Sobral Sep 08 '12 at 00:11
2

I would use the Apache library commons-compress, see http://commons.apache.org/compress/

It has support for reading Zip-files via streams, there is in-depth documentation at http://commons.apache.org/compress/zip.html for a detailed documentation. It also states some limitations which are inherent in the Zip-Format.

Sample code looks as follows:

ZipArchiveInputStream zip =
    new ZipArchiveInputStream(inputStream);
try {
    ZipArchiveEntry entry = zip.getNextZipEntry();
    while(entry != null) {
        assertEquals("README", entry.getName());
        ...
        entry = zip.getNextZipEntry();
    }
} finally {
    zip.close();
}
centic
  • 15,565
  • 9
  • 68
  • 125
  • Thanks for that link on the Apache Commons, because it expresses correctly the problem of using *a stream* as a Zip. That is not an inherent limitation of the Zip-Format, but of using Streams to handle the Zip-Format, and that's exactly the limitations I need to get around. – Daniel C. Sobral Sep 11 '12 at 14:56
  • I think it problems are actually caused by how the zip format is defined,i.e. having some of the information only stored at the end of the file makes it impossible to accurately handle complicated zips without loading the full file first. Apache Compress uses a compromise in that they provide a streaming interface, but sacrifice some features which are rarely used in zips anyway. So if you know the source of the zips you can be sure that such zips do not occur and be fine with Apache commons. – centic Sep 11 '12 at 17:29
  • Loading the full file first I can do; knowing the source of the zips beforehand I can't -- if I could, I wouldn't be here asking this question, nor would I have offered a bounty on it. – Daniel C. Sobral Sep 13 '12 at 01:54
2

This question sounds similar to How to create a directory in memory? pseudo file system / virtual directory. Basically, my suggestion is to use a more general solution- an in-memory virtual filesystem (and I don't mean on OS level, like Linux' ramfs/tmpfs).

One example is to use the Java 7 NIO APIs, which now provide an SPI for implementing a file system via FileSystemProvider. It seems that the ShrinkWrap filesystem implements this SPI.

A more accessible option would be to use Apache Commons VFS' ram filesystem: it requires only Java 5. If you need to be compatible with Java 5 and 6, this is probably your best bet.

I first remember reading about in-memory filesystems in Java from this article, which apart from pointing out solutions like Commons VFS and JBoss Microcontainer, gives a nice example use case for the NetBeans IDE.

While an in-memory virtual filesystem is a nice general solution of avoiding the OS-level filesystem (with the associated performance benefits), it probably suffers from other disadvantages, which more specialized solutions could address. For instance, I am not sure how using this filesystem would behave when used concurrently from multiple threads. It might work fine as long as you don't access the same files, or you might need to create separate filesystems (which might be prohibitive in terms of resource usage).

Community
  • 1
  • 1
vdichev
  • 83
  • 6