I have some large (even gzipped around 10GB) files, which contain an ASCII header and then in principle numpy.recarrays of about 3MB each, we call them "events". My first approach looked like this:
f = gzip.GzipFile(filename)
f.read(10000) # fixed length ascii header
event_dtype = np.dtype([
('Id', '>u4'), # simplified
('UnixTimeUTC', '>u4', 2),
('Data', '>i2', (1600,1024) )
])
event = np.fromfile( f, dtype = event_dtype, count=1 )
However, this is not possible, since np.fromfile needs a real FILE object, because it really makes low level calls (found a pretty old ticket https://github.com/numpy/numpy/issues/1103).
So as I understand I have to do it like this:
s = f.read( event_dtype.itemsize )
event = np.fromstring(s, dtype=event_dtype, count=1)
And yes, it works! But isn't this awfully inefficient? Isn't the mem for s allocated, and garbage collected for every event? On my laptop I reach something like 16 events/s, i.e. ~50MB/s
I wonder if anybody knows a smart way, to allocate the mem once and then let numpy read directly into that mem.
Btw. I'm a physicist, so ... well still a newbie in this business.