efficient numpy.fromfile on zipped files?

Question

I have some large (even gzipped around 10GB) files, which contain an ASCII header and then in principle numpy.recarrays of about 3MB each, we call them "events". My first approach looked like this:

f = gzip.GzipFile(filename)
f.read(10000) # fixed length ascii header
event_dtype = np.dtype([
        ('Id', '>u4'),                # simplified
        ('UnixTimeUTC', '>u4', 2), 
        ('Data', '>i2', (1600,1024) ) 
        ])
event = np.fromfile( f, dtype = event_dtype, count=1 )

However, this is not possible, since np.fromfile needs a real FILE object, because it really makes low level calls (found a pretty old ticket https://github.com/numpy/numpy/issues/1103).

So as I understand I have to do it like this:

s = f.read( event_dtype.itemsize )
event = np.fromstring(s, dtype=event_dtype, count=1)

And yes, it works! But isn't this awfully inefficient? Isn't the mem for s allocated, and garbage collected for every event? On my laptop I reach something like 16 events/s, i.e. ~50MB/s

I wonder if anybody knows a smart way, to allocate the mem once and then let numpy read directly into that mem.

Btw. I'm a physicist, so ... well still a newbie in this business.

The time taken by the I/O is like *thousands* of times bigger then the time taken to allocate/deallocate that string. You should profile the code to see where the bottleneck is and then optimize it... guessing where the bottleneck is is bad, even more if you're not used to programming efficiently. — Bakuriu, Apr 12 '13 at 11:19
As long as you're okay with read-only arrays, you could use `numpy.frombuffer` to avoid duplicating the memory and just use the string as a memory buffer. — Joe Kington, Apr 12 '13 at 14:21
@Bakariu thanks for clearly phrasing that. I have no experience with profiling code. And its good to hear, that guessing is bad. — Dominik Neise, Apr 28 '13 at 17:09
@Joe Kington. Thanks for the clear example! I will go for it. — Dominik Neise, Apr 28 '13 at 17:11

score 7 · Accepted Answer · answered Apr 12 '13 at 23:26

@Bakuriu is probably correct that this is probably a micro-optimization. Your bottleneck is almost definitely IO, and after that, decompression. Allocating the memory twice probably isn't significant.

However, if you wanted to avoid the extra memory allocation, you could use numpy.frombuffer to view the string as a numpy array.

This avoids duplicating memory (the string and the array use the same memory buffer), but the array will be read-only, by default. You can then change it to allow writing, if you need to.

In your case, it would be as simple as replacing fromstring with frombuffer:

f = gzip.GzipFile(filename)
f.read(10000) # fixed length ascii header
event_dtype = np.dtype([
        ('Id', '>u4'),                # simplified
        ('UnixTimeUTC', '>u4', 2), 
        ('Data', '>i2', (1600,1024) ) 
        ])
s = f.read( event_dtype.itemsize )
event = np.frombuffer(s, dtype=event_dtype, count=1)

Just to prove that memory is not duplicated using this approach:

import numpy as np

x = "hello"
y = np.frombuffer(x, dtype=np.uint8)

# Make "y" writeable...
y.flags.writeable = True

# Prove that we're using the same memory
y[0] = 121
print x # <-- Notice that we're outputting changing y and printing x...

This yields: yello instead of hello.

Regardless of whether or not it's a significant optimization in this particular case, it's a useful approach to be aware of.

Big plus one for frombuffer! I was trying to use fromfile on a zipped file for a while and that was the key. — Fractaly, Apr 23 '19 at 03:30

efficient numpy.fromfile on zipped files?

1 Answers1