0

I once downloaded a web page using curl, and the resulting file contains the compressed HTML code. I would like to decompress it.

I tried this Python code

import gzip
f = gzip.open(file_name, 'rb')
file_content = f.read()
f.close()

which results in the following error: gzip.BadGzipFile: Not a gzipped file (b'\x1f\xc2').

\x1f and \xc2 are the first two bytes of the file. That is confirmed by:

with open(file_name, "rb") as f :
    binary_file_content = f.read()
for i in range(12):
    print(binary_file_content[i], end=" ")

which prints the first few bytes of the file: 31 194 139 8 0 0 0 0 0 0 3 195 (where 31 and 194 are decimal values of previously seen 1F and C2).

Do the first bytes provide a hint as to which decompressing method should be used? (I made a few tests with zlib.decompress but that failed so far.)

Edit: The output of file myCompressedFile is data.

Georg
  • 1,078
  • 2
  • 9
  • 18
  • It's definitely *possible*; for instance, 7-Zip can attempt to unzip any file, regardless of its extension. Since it's open source you could theoretically see how it works and not have to re-invent the wheel. – Random Davis Feb 03 '22 at 23:05
  • 1
    Try https://en.m.wikipedia.org/wiki/File_(command) – Kelly Bundy Feb 03 '22 at 23:23
  • The file command doesn't seem to know what that is. It seems likely that it's a compressed file format, since there are a few that start with `1f`, such as compress and gzip. But I've never seen `1f c2`. – Mark Adler Feb 04 '22 at 00:49
  • 1
    Since there is no standard that says all compressed file formats must start with some unique signature / sequence of bytes, I'd say no, it's not possible. The best you could do is check for one or more that do so they're handled appropriately. – martineau Feb 04 '22 at 01:03

0 Answers0