4

When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this:

with gzip.open(filename) as bytestream:
    bytestream.read(16) 
    buf = bytestream.read(
        IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS
    )
    data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)

While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all.

This is the documentation for the read() function:

Read at most n characters from stream.

Read from underlying buffer until we have n characters or we hit EOF. If n is negative or omitted, read until EOF.

If that is the case, the functional role of the first line bytestream.read(16) would have to be reading and thus skipping the first 16 characters, presumably because they act as meta-data or header. However, when I have some images, how would I know to use 16 as the argument for the read call, instead of say, 32 or, 8, or 64?

I recalled plenty a time coming across completely identical code as above except having the author use bytestream.read(8) instead of bytestream.read(16) or just as likely, any other value. Digging into the file character-by-character show no discernible pattern to determine the length of the header character.

In other words, how do one determine the parameter to be used in the read function call? or how do one know the length of the header characters in a gzip-compressed file?

My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that.

Reproducible details

My hypothesis, after countless hours of troubleshooting is that the first 16 characters represent some sort of header or meta-data. So the first line in that code is to skip the 16 characters and store the remaining in a variable named buf. However, digging into the data I found no way to determine why or how the value 16 was chosen. I have read the bytes in character by character, and also tried reading + casting them as np.float, but there is no discernible patterns that suggest the meta-data ends at the 16th character and the actual data begins on the 17th.

The following code reads the data from this website and extracts the first 30 characters. Notice that it is indiscernible where the header row "ends" (16th apparently, after the second appearance of \x1c`) and the data begins:

import gzip
import numpy as np

train_data_filename = 'data_input/train-images-idx3-ubyte.gz'
IMAGE_SIZE = 28
NUM_CHANNELS = 1

def extract_data(filename, num_images):
    with gzip.open(filename) as bytestream:
        first30 = bytestream.read(30)
        return first30

first30= extract_data(train_data_filename, 10)
print(first30)
# returns: b'\x00\x00\x08\x03\x00\x00\xea`\x00\x00\x00\x1c\x00\x00\x00\x1c\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

If we modify the code to cast them as np.float32, such that all characters were now in numeric (float), again there was no apparent pattern to distinguish where the header / meta-data ends and where the data begins.

Any reference or advice would be very appreciated!

onlyphantom
  • 8,606
  • 4
  • 44
  • 58
  • it just specific the buffer size. See this :https://stackoverflow.com/questions/1035340/reading-binary-file-and-looping-over-each-byte – I_Al-thamary Jan 03 '19 at 10:38
  • This is very specific to the individual data format. The folks writing the code presumably knew enough about what they were parsing to make the assumption at hand. Your question *doesn't specify anything whatsoever* about the data format the code is intended to parse, whereas the code's needs are entirely driven by that format's specification, so... how is this expected to be answerable? – Charles Duffy Jan 03 '19 at 14:39
  • Only a very small subset of formats (like JSON, or s-expressions, or msgpack) are schema-carrying -- in a large majority of cases, details of which fields exist at which offsets &c. are out-of-band. If you're lucky, there's a formal specification document, or something like a protobuf spec that can be used to automatically generate machine parsers... but whether those things exist isn't something where we could even tell you where to start unless you find documentation from the author of the data describing the format it's in. – Charles Duffy Jan 03 '19 at 14:44
  • I understand the value 8 is chosen in a way specific to that problem _at hand_. But the question arises out of an observably general pattern, so I was really hoping to understand on a more general level, how, given a gzip-compressed file, can an end user know how to read from the file while escaping the first n characters of headers / meta-data. Without the a priori knowledge that the author of the file has – onlyphantom Jan 03 '19 at 14:48
  • Why do you expect there to be any difference between deciding how large the header is in a stream that's being decompressed from gzip'd input and deciding how large the header is in any other format? This isn't a gzip header, it's a header *within the content itself*, and it's subject to that content's file format. Using `gzip.open(...).read()` will only ever return contents that are (from gzip's perspective) data -- there's no gzip-specific metadata returned at all. – Charles Duffy Jan 03 '19 at 14:49
  • ...as for the larger question of "how do I parse a file without knowing its syntax?", unless you're talking about writing analysis tools that generalize patterns from a series of files in a corpus (something that *has* been done, but it's more a place to be searching research papers than asking on SO -- the results are typically not reliable enough to be used in production software, and thus more of interest to reverse engineers and academics than people trying to build robust software from scratch), you don't. – Charles Duffy Jan 03 '19 at 14:53
  • ...I mean, people wouldn't engineer schema-carrying formats *in the first place* if there were some generic solution that didn't require them. – Charles Duffy Jan 03 '19 at 14:55

2 Answers2

2

From gzip's perspective, everything it's returning to you is data. There is no metadata or gzip-specific header contents prepended to that data stream, so there's no need for any kind of algorithm to figure out how much content gzip is prepending to that stream: The number of bytes it prepends is zero.


Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.

That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:

0000     32 bit integer  0x00000803(2051) magic number 
0004     32 bit integer  60000            number of images 
0008     32 bit integer  28               number of rows 
0012     32 bit integer  28               number of columns 

Thus, if you want to skip all four of those items, you would take 16 bytes off the top.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • thanks a lot, this is going a long way to clearing some of that. Can you break it down to how you calculated "16 bytes off the top"? Really appreciate it. – onlyphantom Jan 03 '19 at 14:54
  • 1
    The items listed are 4 32-bit integers; 8 bits to a byte; so (4*32/8) == 16. – Charles Duffy Jan 03 '19 at 14:56
  • That's perfect. Thank you. – onlyphantom Jan 03 '19 at 14:57
  • 1
    The file format looks a bit more complicated than that. The magic number is encoding the data type and number of dimensions of the data. The third byte is `0x08` which the page lists as meaning the data is in unsigned bytes. The fourth byte is `0x03`, which means there will we three subsequent dimensions, each 32-bit big endian integers (I presume unsigned, but it doesn't say). So in total there will be a 16 byte header. It would seem the code snippet has basically ignored the general file format and focused on a specific subset of the file format. – Dunes Jan 03 '19 at 15:09
  • @Dunes so if I understand this correctly, there’s the first four bytes up to how you explained it in the first two sentences. Then 12 extra bytes because we take it that 0x03 means three subsequent dimensions, each 4 bytes for a total of 12. 12+4 (first 4) = 16. Is that correct? – onlyphantom Jan 03 '19 at 15:30
0

From the code snippet, bytestream.read(16) reads or skips the first 16 bytes of bytestream. When you quoted that read() reads at most n characters from the stream, it does so, but also it appears that python stores a single character in 1 byte, making 16 characters occupy 16 bytes.

See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data

The code snippet is primarily interested in the contents of buf, skipping the first 16 bytes of the stream. To understand how to determine the parameter that goes into first bytestream.read() AKA determine how many bytes of the compressed image file to skip, we must understand what the rest of the code does. Particularly, what file are we reading and what are we trying to accomplish with numpy(?) library (saving rgb images in a 1D numpy array?).

I am definitely not an expert on image processing, but it seems that bytestream.read(16) is a unique solution for a unique problem of processing some unique compressed image file. Thus, it is hard to tell how to determine how many bytes to skip without seeing more code and understanding more logic behind the snippet.

  • The code `bytestream.read(16)` reads **and** skips the first 16 characters, based on documentation I read online. Most likely it's trying to skip the metadata such as headers. However, how to determine that parameter to be 16 instead of 8 or 32 is still beyond me. Let me update the question a bit more with added details – onlyphantom Jan 03 '19 at 13:46