4

I am trying to read in a very large (several GB) binary file with numpy.fromfile(). Reading in the entire file at once generates an out of memory error, so I want to create a loop to read and process N chunks of data at a time. Something like the following:

while True:
   numpy.fromfile(f, recordType, N)
   # proccess data 
   if f.EOF():
        break

How do I detect when I have reached the end of the file, so that I can break my loop?

Rachel
  • 43
  • 6
  • Perhaps using the h5py library is an option: https://stackoverflow.com/q/36291562/67579 – Willem Van Onsem Jul 25 '17 at 23:12
  • instead of a while loop, loop up the size of the file first and loop for the number of chunks you need to loop through – MrE Jul 25 '17 at 23:15
  • read the docs too, this method doesn't seem to be very portable or able to read any file... https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html – MrE Jul 25 '17 at 23:16
  • @MrE that would probably make sense in this case, especially since an arbitrary N might not even divide the number of chunks in the file, but in general, is there no way to detect the end of file, if you aren't using read()? – Rachel Jul 25 '17 at 23:20
  • i thought you might be able to detect that you didn't read as many objects as you asked for, but the method does not return the count of object or anything apparently. – MrE Jul 25 '17 at 23:36
  • from the doc, fromfile will mostly read files written with tofile, or specific file formats. Did you check you were able to read your type of file? – MrE Jul 25 '17 at 23:37
  • anyways, if you get a OOM error when loading the file, you will get a OOM when reading with fromfile; unless you can process your data in chunks, there is no point in doing this. – MrE Jul 25 '17 at 23:38
  • @MrE I was using fromfile for a while for this type of file and everything worked fine, but I started working with larger files so when I ran the code as is, I got the OOM. I just want to adapt the current code to start handling the data in chunks, instead of loading it all at once. – Rachel Jul 25 '17 at 23:42
  • actually, i was wrong, the method does return the array, so you just need to check if the array length patches the number of objects you asked for. if you got less, then you know you read to the end of the file. worse case you have an exact partition of objects and the following call will return 0 objects – MrE Jul 25 '17 at 23:45
  • Is `f` a filename or an open file? – hpaulj Jul 26 '17 at 00:11
  • @hpaulj it can be either – Rachel Jul 26 '17 at 00:13
  • @hpaulj although in this case it makes sense to pass an open file, and not reopen it each time – Rachel Jul 26 '17 at 00:14
  • If it's an open file, you can load the next chunk; if it's filename, it starts reading from the start. `fromfile` doesn't have an `offset` parameter. – hpaulj Jul 26 '17 at 00:32

1 Answers1

3
while True:
   a = numpy.fromfile(f, recordType, N)
   # proccess data 
   if a.size < N:
        break
MrE
  • 19,584
  • 12
  • 87
  • 105