How to detect EOF with numpy.fromfile

Question

I am trying to read in a very large (several GB) binary file with numpy.fromfile(). Reading in the entire file at once generates an out of memory error, so I want to create a loop to read and process N chunks of data at a time. Something like the following:

while True:
   numpy.fromfile(f, recordType, N)
   # proccess data 
   if f.EOF():
        break

How do I detect when I have reached the end of the file, so that I can break my loop?

Perhaps using the h5py library is an option: https://stackoverflow.com/q/36291562/67579 — Willem Van Onsem, Jul 25 '17 at 23:12
instead of a while loop, loop up the size of the file first and loop for the number of chunks you need to loop through — MrE, Jul 25 '17 at 23:15
read the docs too, this method doesn't seem to be very portable or able to read any file... https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html — MrE, Jul 25 '17 at 23:16
@MrE that would probably make sense in this case, especially since an arbitrary N might not even divide the number of chunks in the file, but in general, is there no way to detect the end of file, if you aren't using read()? — Rachel, Jul 25 '17 at 23:20
i thought you might be able to detect that you didn't read as many objects as you asked for, but the method does not return the count of object or anything apparently. — MrE, Jul 25 '17 at 23:36
from the doc, fromfile will mostly read files written with tofile, or specific file formats. Did you check you were able to read your type of file? — MrE, Jul 25 '17 at 23:37
anyways, if you get a OOM error when loading the file, you will get a OOM when reading with fromfile; unless you can process your data in chunks, there is no point in doing this. — MrE, Jul 25 '17 at 23:38
@MrE I was using fromfile for a while for this type of file and everything worked fine, but I started working with larger files so when I ran the code as is, I got the OOM. I just want to adapt the current code to start handling the data in chunks, instead of loading it all at once. — Rachel, Jul 25 '17 at 23:42
actually, i was wrong, the method does return the array, so you just need to check if the array length patches the number of objects you asked for. if you got less, then you know you read to the end of the file. worse case you have an exact partition of objects and the following call will return 0 objects — MrE, Jul 25 '17 at 23:45
@hpaulj although in this case it makes sense to pass an open file, and not reopen it each time — Rachel, Jul 26 '17 at 00:14
If it's an open file, you can load the next chunk; if it's filename, it starts reading from the start. `fromfile` doesn't have an `offset` parameter. — hpaulj, Jul 26 '17 at 00:32

score 3 · Accepted Answer · answered Jul 25 '17 at 23:46

3

while True:
   a = numpy.fromfile(f, recordType, N)
   # proccess data 
   if a.size < N:
        break

answered Jul 25 '17 at 23:46

MrE

19,584
12
87
105

How to detect EOF with numpy.fromfile

1 Answers1