4

I was presented with the situation where a file with a proprietary format was compressed to a .gz, then subsequently renamed it back to its original extension and then compressed again. I would like to capture such scenario and wonder whether there is a way to detect when a file has been compressed twice.

I am reading the .gz files as follows:

 GZIPInputStream gzip = new GZIPInputStream(Files.newInputStream(inFile));
 BufferedReader breader = new BufferedReader(new InputStreamReader(gzip)); 
nhouser9
  • 6,730
  • 3
  • 21
  • 42
panza
  • 1,341
  • 7
  • 38
  • 68
  • 1
    Maybe this will help: http://stackoverflow.com/a/30328554/5244131 – Rabbit Guy Aug 09 '16 at 19:29
  • I attended an interesting talk a while ago, about detecting when images had been photoshopped (for court cases and the like). One thing that came up was that the compression lost color information in such a way that if a file had been compressed, edited and compressed again, there would be statistically visible "holes" in the spread of color information resulting from the overlap of multiple compression methods. I don't know if that helps, at all, but there may be similar statistical artifacts that come up for file compression. – Edward Peters Aug 09 '16 at 19:34

2 Answers2

2

You can check for a valid gzip header within the file. A gzip file should contain a defined header starting with a 2-byte number with values 0x1f and 0x8b (see spec ). You can check these bytes to see if they match the header values:

InputStream is = new FileInputStream(new File(filePath));
byte[] b = new byte[2];
int n = is.read(b);
if ( n != 2 ){
    //not a gzip file
}
if ( (b[0] == (byte) 0x1f) && (b[1] == (byte)0x8b)){
    //2-byte gzip header
}

These two bytes alone have an ~1/65k chance of randomly occurring, but depending upon the data you expect to receive can be enough to base your decision. To be more confident of the call you can read further into the header to be sure it follows valid spec values (see link above - eg third byte is typically but not always an 8 for DEFLATE compression, and so on...)

copeg
  • 8,290
  • 19
  • 28
  • `if ( n != 2 )` isn't strictly correct. `read(byte[])` attempts to read *at least* one byte, but there isn't a guarantee it will fill the buffer until EOF. Your best bet here is two `read()` calls. – David Ehrmann Aug 09 '16 at 19:53
  • @copeg, thank you, this helps a lot - accepted answer – panza Aug 21 '16 at 08:52
1

A brute force way would be: uncompress the file; and if that works; try to uncompress it again. If that works again, you know that it was compressed (at least twice). But worst case, it could still be compressed.

And actually; I dont other ways to figure that.

You see, in the end, compression is about changing the bytes of your file. SO, even when the second compression doesn't do much to the content of the file; it still changes some bytes. So, just from looking at those bytes, you wont see what is going on.

GhostCat
  • 137,827
  • 25
  • 176
  • 248