59

Is there any way to check if InputStream has been gzipped? Here's the code:

public static InputStream decompressStream(InputStream input) {
    try {
        GZIPInputStream gs = new GZIPInputStream(input);
        return gs;
    } catch (IOException e) {
        logger.info("Input stream not in the GZIP format, using standard format");
        return input;
    }
}

I tried this way but it doesn't work as expected - values read from the stream are invalid. EDIT: Added the method I use to compress data:

public static byte[] compress(byte[] content) {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    try {
        GZIPOutputStream gs = new GZIPOutputStream(baos);
        gs.write(content);
        gs.close();
    } catch (IOException e) {
        logger.error("Fatal error occured while compressing data");
        throw new RuntimeException(e);
    }
    double ratio = (1.0f * content.length / baos.size());
    if (ratio > 1) {
        logger.info("Compression ratio equals " + ratio);
        return baos.toByteArray();
    }
    logger.info("Compression not needed");
    return content;

}
voo
  • 1,293
  • 1
  • 12
  • 18
  • Where does the `InputStream` come from? From `URLConnection#getInputStream()`? In a bit decent protocol like HTTP, the enduser should already be instructed somehow that the content is gzipped. – BalusC Jan 27 '11 at 15:47
  • Given that GZIP has a 32 bit CRC, I find that surprising. A corrupt stream should throw an exception at the end at least. – Peter Lawrey Jan 27 '11 at 15:47
  • I'm wondering if the OP means that values read from the stream AFTER the IOException occurs are not valid... which would make sense because the GZIPInputStream constructor would have consumed some of the bytes from the stream. – Eric Giguere Jan 27 '11 at 15:50
  • Values are corrupted after the IOException occured. The InputStream comes from HttpURLConnection#getInputStream() – voo Jan 27 '11 at 15:53
  • Right, that's because the GZipInputStream reads bytes from the original input stream. So you need to buffer the input stream as shown in the answer below. – Eric Giguere Jan 27 '11 at 15:58
  • 1
    So the general solution is to create a BufferedInputStream wrapping the original input stream, then call "mark" to mark the beginning of the stream. Then wrap a GZipInputStream around that. If no exception occurs, return the GZipInputStream. If an exception occurs, call "reset" and return the BufferedInputStream. – Eric Giguere Jan 27 '11 at 16:06

10 Answers10

78

It's not foolproof but it's probably the easiest and doesn't rely on any external data. Like all decent formats, GZip too begins with a magic number which can be quickly checked without reading the entire stream.

public static InputStream decompressStream(InputStream input) {
     PushbackInputStream pb = new PushbackInputStream( input, 2 ); //we need a pushbackstream to look ahead
     byte [] signature = new byte[2];
     int len = pb.read( signature ); //read the signature
     pb.unread( signature, 0, len ); //push back the signature to the stream
     if( signature[ 0 ] == (byte) 0x1f && signature[ 1 ] == (byte) 0x8b ) //check if matches standard gzip magic number
       return new GZIPInputStream( pb );
     else 
       return pb;
}

(Source for the magic number: GZip file format specification)

Update: I've just dicovered that there is also a constant called GZIP_MAGIC in GZipInputStream which contains this value, so if you really want to, you can use the lower two bytes of it.

biziclop
  • 48,926
  • 12
  • 77
  • 104
  • 2
    I believe you need to use the 2-arg constructor for PushBackInputStream, since by default it only allows you to push back 1 bytes (and pb.unread(signature) pushes back 2 bytes). e.g. `new PushBackInputStream(input, 2)` – overthink Aug 02 '11 at 19:07
  • Áoverthink You're absolutely right, Sir. Well spotted and thank you. – biziclop Aug 03 '11 at 14:15
  • No prob. Useful answer, btw! – overthink Aug 03 '11 at 16:21
  • 4
    Good approach, but there is a bug when the stream is empty or has only one byte. You need to check the number of bytes read, and write back only those read. The signature check should then only be done if both bytes were read successfully. – Alexander Torstling Apr 29 '13 at 08:52
  • It's "PushbackInputStream" in case anyone copies and pastes the code. – Anoyz Sep 11 '13 at 11:24
  • 1
    Therefore it should be `int nread = pb.read( signature ); if (nread > 0) pb.unread( signature, 0, nread );` – 18446744073709551615 May 14 '15 at 08:11
  • 1
    Is there a way to reset the original stream after the two bytes are read? I need to handle the original stream, not a new GZIPInputStream since it seems like creating a new GZIPInputStream object creates a new stream thats 10kb bigger – McLovin Jul 14 '16 at 18:12
  • 1
    @McLovin You can't reset the original stream (unless it supports mark/reset operations, which isn't guaranteed), all you can reset is the pushbackinputstream you wrap the original stream in. – biziclop Aug 24 '16 at 21:58
  • 1
    `new GZIPInputStream` consumes bytes so if there's some other `ZipException` problem you've corrupted the stream anyway. – amos Sep 16 '16 at 18:36
  • @amos Yes, that's the nature of every stream in general. (Apart from specially designed protocols, like streaming video, where you don't mind if a couple of second's worth of data is lost, so long as you eventually recover.) – biziclop Sep 16 '16 at 19:46
  • 1
    With GZIP_MAGIC and Guava: if (len == 2 && GZIPInputStream.GZIP_MAGIC == Ints.fromBytes((byte) 0, (byte) 0, signature[1], signature[0])) – blacelle Jun 13 '17 at 13:37
40

The InputStream comes from HttpURLConnection#getInputStream()

In that case you need to check if HTTP Content-Encoding response header equals to gzip.

URLConnection connection = url.openConnection();
InputStream input = connection.getInputStream();

if ("gzip".equals(connection.getContentEncoding())) {
    input = new GZIPInputStream(input);
}

// ...

This all is clearly specified in HTTP spec.


Update: as per the way how you compressed the source of the stream: this ratio check is pretty... insane. Get rid of it. The same length does not necessarily mean that the bytes are the same. Let it always return the gzipped stream so that you can always expect a gzipped stream and just apply GZIPInputStream without nasty checks.

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • Unfortunately, this is not exactly what I need since I use http to exchange binary data in the client-server architecture and as a result Content-Encoding is not set. Additionally, I won't be able to call getContentEndoing when the request comes from the client who's served by the servlet. But still thank you for the answer. – voo Jan 27 '11 at 16:15
  • 1
    Then other side is in essence abusing the HTTP protocol or it is not a HTTP service at all. Contact with the service admin how to figure in their way if the response is gzipped or not. Edit: wait, do you mean that there's a servlet which is proxying the request and that your input is coming from its response? Then that servlet needs to be fixed that it copies all mandatory HTTP headers as well. – BalusC Jan 27 '11 at 16:16
  • 1
    Last time I checked you were allowed to transport any kind of content over HTTP, gzip included, so it's not really an abuse. – biziclop Jan 27 '11 at 16:42
  • 1
    @biziclop: that abuse was not about using gzip content encoding (heck, I even included the HTTP spec link about this in my initial answer), but about not sending the mandatory HTTP headers along it (which thus means that OP is violating the HTTP spec). – BalusC Jan 27 '11 at 16:44
  • As per ratio check, I am not sure if you're right. For instance, compressing 32 bytes of data results in the 56 bytes being sent to the client and this made me wondering and search for a solution. – voo Jan 27 '11 at 16:46
  • 1
    Sounds like you're attempting to compress binary content instead of textual content. Is this true? Why would you ever attempt to compress binary content? In normal HTTP servers/clients, gzip is generally only applied on `Content-Type` starting with `text/` like `text/plain`, `text/html`, `text/css`, etc. – BalusC Jan 27 '11 at 16:48
  • 1
    @BalusC "When present, its value indicates what additional content codings have been applied to the entity-body, and thus what decoding mechanisms must be applied in order to obtain the media-type referenced by the Content-Type header field" Which clearly means that if I want to transmit gzipped content, I shouldn't (indeed I mustn't) set the content-encoding field. Just to make it clear: not some content transport-coded in gzip but a file which happens to be gzip format. – biziclop Jan 27 '11 at 16:53
  • @BalusC *Sigh* The first time you want to transmit a gzip file, you'll understand what I meant. – biziclop Jan 27 '11 at 17:01
27

I found this useful example that provides a clean implementation of isCompressed():

/*
 * Determines if a byte array is compressed. The java.util.zip GZip
 * implementation does not expose the GZip header so it is difficult to determine
 * if a string is compressed.
 * 
 * @param bytes an array of bytes
 * @return true if the array is compressed or false otherwise
 * @throws java.io.IOException if the byte array couldn't be read
 */
 public boolean isCompressed(byte[] bytes)
 {
      if ((bytes == null) || (bytes.length < 2))
      {
           return false;
      }
      else
      {
            return ((bytes[0] == (byte) (GZIPInputStream.GZIP_MAGIC)) && (bytes[1] == (byte) (GZIPInputStream.GZIP_MAGIC >> 8)));
      }
 }

I tested it with success:

@Test
public void testIsCompressed() {
    assertFalse(util.isCompressed(originalBytes));
    assertTrue(util.isCompressed(compressed));
}
crusy
  • 1,424
  • 2
  • 25
  • 54
Aaron Roller
  • 1,074
  • 1
  • 14
  • 19
11

I believe this is simpliest way to check whether a byte array is gzip formatted or not, it does not depend on any HTTP entity or mime type support

public static boolean isGzipStream(byte[] bytes) {
      int head = ((int) bytes[0] & 0xff) | ((bytes[1] << 8) & 0xff00);
      return (GZIPInputStream.GZIP_MAGIC == head);
}
Community
  • 1
  • 1
Oconnell
  • 111
  • 2
5

Building on the answer by @biziclop - this version uses the GZIP_MAGIC header and additionally is safe for empty or single byte data streams.

public static InputStream maybeDecompress(InputStream input) {
    final PushbackInputStream pb = new PushbackInputStream(input, 2);

    int header = pb.read();
    if(header == -1) {
        return pb;
    }

    int b = pb.read();
    if(b == -1) {
        pb.unread(header);
        return pb;
    }

    pb.unread(new byte[]{(byte)header, (byte)b});

    header = (b << 8) | header;

    if(header == GZIPInputStream.GZIP_MAGIC) {
        return new GZIPInputStream(pb);
    } else {
        return pb;
    }
}
blue
  • 539
  • 3
  • 7
4

This function works perfectly well in Java:

public static boolean isGZipped(File f) {   
    val raf = new RandomAccessFile(file, "r")
    return GZIPInputStream.GZIP_MAGIC == (raf.read() & 0xff | ((raf.read() << 8) & 0xff00))
}

In scala:

def isGZip(file:File): Boolean = {
   int gzip = 0
   RandomAccessFile raf = new RandomAccessFile(f, "r")
   gzip = raf.read() & 0xff | ((raf.read() << 8) & 0xff00)
   raf.close()
   return gzip == GZIPInputStream.GZIP_MAGIC
}
ypriverol
  • 585
  • 2
  • 8
  • 28
1

SimpleMagic is a Java library for resolving content types:

<!-- pom.xml -->
    <dependency>
        <groupId>com.j256.simplemagic</groupId>
        <artifactId>simplemagic</artifactId>
        <version>1.8</version>
    </dependency>

import com.j256.simplemagic.ContentInfo;
import com.j256.simplemagic.ContentInfoUtil;
import com.j256.simplemagic.ContentType;
// ...

public class SimpleMagicSmokeTest {

    private final static Logger log = LoggerFactory.getLogger(SimpleMagicSmokeTest.class);

    @Test
    public void smokeTestSimpleMagic() throws IOException {
        ContentInfoUtil util = new ContentInfoUtil();
        InputStream possibleGzipInputStream = getGzipInputStream();
        ContentInfo info = util.findMatch(possibleGzipInputStream);

        log.info( info.toString() );
        assertEquals( ContentType.GZIP, info.getContentType() );
    }
Abdull
  • 26,371
  • 26
  • 130
  • 172
1

Wrap the original stream in a BufferedInputStream, then wrap that in a GZipInputStream. Next try to extract a ZipEntry. If this works, it's a zip file. Then you can use "mark" and "reset" in the BufferedInputStream to return to the initial position in the stream, after your check.

Amir Afghani
  • 37,814
  • 16
  • 84
  • 124
  • Well, GZip != Zip so the idea is right, but you want to wrap the GZipInputStream, not a ZipInputStream. – Eric Giguere Jan 27 '11 at 16:03
  • True that, I'll fix the answer. – Amir Afghani Jan 27 '11 at 16:13
  • And if the size of the entry overflows the buffer size? – Lawrence Dol Jan 27 '11 at 16:36
  • There's no such thing as a ZipEntry for a GZIPInputStream. GZ streams only contain one file (at least, through the Java API). – GreenGiant May 07 '13 at 18:55
  • I tried something like this, but couldn't get it to work. I'm reading protobufs out of the GZipInputStream, so I'm not sure if it's the protobuf reading code or the GZip code, but the mark was reset afterwards, so I couldn't set the stream back to the beginning. – kybernetikos Aug 04 '15 at 13:21
1

Not exactly what you are asking but could be an alternative approach if you are using HttpClient:

private static InputStream getInputStream(HttpEntity entity) throws IOException {
  Header encoding = entity.getContentEncoding(); 
  if (encoding != null) {
     if (encoding.getValue().equals("gzip") || encoding.getValue().equals("zip") ||      encoding.getValue().equals("application/x-gzip-compressed")) {
        return new GZIPInputStream(entity.getContent());
     }
  }
  return entity.getContent();
}
Richard H
  • 38,037
  • 37
  • 111
  • 138
0

This is how to read a file that CAN BE gzipped:

private void read(final File file)
        throws IOException {
    InputStream stream = null;
    try (final InputStream inputStream = new FileInputStream(file);
            final BufferedInputStream bInputStream = new BufferedInputStream(inputStream);) {
        bInputStream.mark(1024);
        try {
            stream = new GZIPInputStream(bInputStream);
        } catch (final ZipException e) {
            // not gzipped OR not supported zip format
            bInputStream.reset();
            stream = bInputStream;
        }
        // USE STREAM HERE
    } finally {
        if (stream != null) {
            stream.close();
        }
    }
}
TekTimmy
  • 3,066
  • 2
  • 29
  • 33