Best way to detect if a stream is zipped in Java

Question

What is the best way to find out i java.io.InputStream contains zipped data?

Is this part of an HTTP request/response? – Jim Ferrans Nov 27 '09 at 14:14 — Jim Ferrans, Nov 27 '09 at 14:14

score 47 · Answer 1 · edited May 23 '17 at 11:54

Introduction

Since all the answers are 5 years old I feel a duty to write down, what's going on today. I seriously doubt one should read magic bytes of the stream! That's a low level code, it should be avoided in general.

Simple answer

miku writes:

If the Stream can be read via ZipInputStream, it should be zipped.

Yes, but in case of ZipInputStream "can be read" means that first call to .getNextEntry() returns a non-null value. No exception catching et cetera. So instead of magic bytes parsing you can just do:

boolean isZipped = new ZipInputStream(yourInputStream).getNextEntry() != null;

And that's it!

General unzipping thoughts

In general, it appeared that it's much more convenient to work with files while [un]zipping, than with streams. There are several useful libraries, plus ZipFile has got more functionality than ZipInputStream. Handling of zip files is discussed here: What is a good Java library to zip/unzip files? So if you can work with files you better do!

Code sample

I needed in my application to work with streams only. So that's the method I wrote for unzipping:

import org.apache.commons.io.IOUtils;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;

public boolean unzip(InputStream inputStream, File outputFolder) throws IOException {

    ZipInputStream zis = new ZipInputStream(inputStream);

    ZipEntry entry;
    boolean isEmpty = true;
    while ((entry = zis.getNextEntry()) != null) {
        isEmpty = false;
        File newFile = new File(outputFolder, entry.getName());
        if (newFile.getParentFile().mkdirs() && !entry.isDirectory()) {
            FileOutputStream fos = new FileOutputStream(newFile);
            IOUtils.copy(zis, fos);
            IOUtils.closeQuietly(fos);
        }
    }

    IOUtils.closeQuietly(zis);
    return !isEmpty;
}

There are situations where a `ZipOutputStream` is not finished or closed properly that will mean the resulting file will throw an `IOException` when parsed to a `new ZipFile(f)` because it is invalid. The above will not fail, even when zip file is invalid for other purposes. — Rudi Kershaw, Dec 02 '16 at 12:13
zis.getNextEntry() advances the InputStream. If you want to start over on the InputStream in case it isn't a zip file you can't because zis.getNextEntry() has made the InputStream advance. — Luke, Oct 29 '18 at 13:55
@Luke Hm, you're probably right, did you test it? I wrote that quite some time ago, so I'm not sure anymore. — Innokenty, Oct 30 '18 at 14:06
Yes. I found a solution: wrap the inputStream with a BufferedInputStream before passing it to the ZipInputStream, so you can call mark() and reset() on that. https://stackoverflow.com/a/53047891/4265610 — Luke, Oct 30 '18 at 14:58

score 23 · Accepted Answer · answered Nov 27 '09 at 14:20

The magic bytes for the ZIP format are 50 4B. You could test the stream (using mark and reset - you may need to buffer) but I wouldn't expect this to be a 100% reliable approach. There would be no way to distinguish it from a US-ASCII encoded text file that began with the letters PK.

The best way would be to provide metadata on the content format prior to opening the stream and then treat it appropriately.

miku · Answer 3 · 2009-11-27T19:12:33.510

6

Not very elegant, but reliable:

If the Stream can be read via ZipInputStream, it should be zipped.

edited Nov 27 '09 at 19:12

answered Nov 27 '09 at 14:14

miku

181,842
47
306
310

1

It just doesn't seem nice. Couldn't it be a corrupted ZIP stream? – Fedearne Nov 27 '09 at 14:20
11

@fedearne: Is a corrupted zip stream a zip stream? – GvS Nov 27 '09 at 14:22
2

I agree: If ZipInputStream can't read it, it doesn't *matter* that it's "meant" to be a Zip file. Right? – Carl Smotricz Nov 27 '09 at 15:23
2

This is most reliable option. If it's corrupted, how do you know it were ZIP? You just have to make a guess. – ZZ Coder Nov 27 '09 at 19:11
1

@GvS I have stream that are Zipped and stream that are not zipped. I would rather not attempt to parse corrupted zip streams as not zipped, if this could be avoided. – Fedearne Nov 28 '09 at 05:49
1

If you check for 4 magic bytes, 1 out of 4.294.967.295 (completely random) streams will be a false positive. Can you afford that? Are corrupted streams something that will occur more frequently as a non zipped stream starting with the magic bytes? – GvS Nov 28 '09 at 22:54

David Webb · Answer 4 · 2009-11-27T14:44:09.857

6

You could check that the first four bytes of the stream are the local file header signature that starts the local file header that proceeds every file in a ZIP file, as shown in the spec here to be 50 4B 03 04.

A little test code shows this to work:

byte[] buffer = new byte[4];

try {
    ZipOutputStream zos = new ZipOutputStream(new FileOutputStream("so.zip"));
    ZipEntry ze = new ZipEntry("HelloWorld.txt");
    zos.putNextEntry(ze);
    zos.write("Hello world".getBytes());
    zos.close();

    FileInputStream is = new FileInputStream("so.zip");
    is.read(buffer);
    is.close();
}
catch(IOException e) {
    e.printStackTrace();
}

for (byte b : buffer) { 
    System.out.printf("%H ",b);
}

Gave me this output:

50 4B 3 4

edited Nov 27 '09 at 14:44

answered Nov 27 '09 at 14:31

David Webb

190,537
57
313
299

1

I had the same idea (though trusted Wikipedia over the spec - for shame!), but it seems that this is not a reliable mechanism: _"Implementers should be aware that ZIP files may be encountered with or without this signature marking data descriptors and should account for either case when reading ZIP files to ensure compatibility."_ – McDowell Nov 27 '09 at 14:43
1

That's true for a general perspective, but my guess is that if you don't have the signature ZipInputStream will fail as it insists on ZipEntry objects. – David Webb Nov 27 '09 at 14:49
1

You can have random junk prepended to zip files (such as Microsoft Windows executables). Those only work if you use the central directory rather than streaming with local headers. FWIW, the Java PlugIn and WebStart use the central directory but now check the first four bytes as well (see GIARs). – Tom Hawtin - tackline Nov 27 '09 at 17:21
1

Not sure if ZipInputStream will fail on that input. In an intelligent implementation, it will seek forward and *find* that signature. This is the way it's done in self-extracting archives, which on windows, have the PE-COFF signature at the beginning of the file, and the PKZIP zip entry signature within the file, wherever the zip entries are. The file is both an EXE and a ZIP. Will java's ZipInputStream read this stream? I don't know but it *should*. The ZipInputStream class in other implementations (in DotNetZip for example) can and will read this as a zip stream. – Cheeso Nov 28 '09 at 12:15

score 0 · Answer 5 · answered Nov 12 '15 at 05:49

0

Checking the magic number may not be the right option.

Docx files are also having similar magic number 50 4B 3 4

answered Nov 12 '15 at 05:49

kk nair

41
2
5

7

Thats because docx files are zip files. – tak3shi Mar 03 '16 at 08:56

score 0 · Answer 6 · answered Sep 23 '19 at 14:05

Since both .zip and .xlsx having the same Magic number, I couldn't find the valid zip file (if renamed).

So, I have used Apache Tika to find the exact document type.

Even if renamed the file type as zip, it finds the exact type.

Reference: https://www.baeldung.com/apache-tika

k3b · Answer 7 · 2021-06-19T02:34:25.097

I combined answers from @McDowell and @Innokenty to a small lib function that you can paste into you project:

public static boolean isZipStream(InputStream inputStream) {
    if (inputStream == null || !inputStream.markSupported()) {
        throw new IllegalArgumentException("InputStream must support mark-reset. Use BufferedInputstream()");
    }
    boolean isZipped = false;
    try {
        inputStream.mark(2048);
        isZipped = new ZipInputStream(inputStream).getNextEntry() != null;
        inputStream.reset();
    } catch (IOException ex) {
        // cannot be opend as zip.
    }
    return isZipped;
}

You can use the lib like this:

public static void main(String[] args) {
    InputStream inputStream = new BufferedInputStream(...);

    if (isZipStream(inputStream)) {
        // do zip processing using inputStream
    } else {
        // do non-zip processing using inputStream
    }

}

Best way to detect if a stream is zipped in Java

7 Answers7

Linked