0

I have one big file saved using numpy in append mode, i.e., it contains maybe 5000 arrays, each with shape, e.g. [1, 224, 224, 3], like this way:

filepath = 'hello'
for some loop:
    ...
    with open(filepath, 'ab') as f:
        np.save(f, ndarray)

I need to load the data in the file, maybe all arrays, or maybe in some generating mode, like reading the first 100, then the next 100, and so on. Is there any method to do this properly? Now, I only know if I use np.load, I can only get one array each time, and I don't know how to read the 100 to 199 arrays.

loading arrays saved using numpy.save in append mode This question talk about something on this, but seems not what I want.

U13-Forward
  • 69,221
  • 14
  • 89
  • 114
JQK
  • 93
  • 1
  • 9
  • If you recorded the file sizes during creation, you might be able to use `seek` to jump ahead to right spot. But as the link shows this is an undocumented feature, and you basically on your own. – hpaulj Jun 29 '18 at 03:45
  • Thanks @hpaulj, but unfortunately I did not record the size, as it is stored dynamically and record such size might be complicated a little... Is there any better way to solve this? – JQK Jun 29 '18 at 04:01
  • Are you interested in loading individual arrays? Or concatenating arrays along an axis? – alta Jun 29 '18 at 04:16
  • I want to concatenate them into one array, but I don't know how many arrays each file contains. @HanAltae-Tran – JQK Jun 29 '18 at 04:22
  • Do you control the save loop? Looks like a good place to fix. Especially switching the order of the loop and the with. – Mad Physicist Jun 29 '18 at 04:29
  • Also, are all the arrays of the same dtype as well as size? If so, load one, do an ftell (or whatever Python calls it), and voila, you have the size of each array in the file – Mad Physicist Jun 29 '18 at 04:32

2 Answers2

1

One solution, although ugly and can only get all arrays in the file (and thus risk the out of memory error) is as following:

a = []
with open(filepath, 'rb') as f:
    while True:
        try:
            a.append(np.load(f))
        except:
            break
np.stack(a)
JQK
  • 93
  • 1
  • 9
1

This is more of a hack (given your situation).

Anyway, here is the one that created the files with np.save in append mode:

import numpy as np

numpy_arrays = [np.array ([1, 2, 3]), np.array([0, 9])]

print numpy_arrays[0], numpy_arrays[1]
print type(numpy_arrays[0]), type(numpy_arrays[1])
for numpy_array in numpy_arrays:
    with open ("./my-numpy-arrays.bin", 'ab') as f:
        np.save(f, numpy_array)

[1 2 3] [0 9]
<type 'numpy.ndarray'> <type 'numpy.ndarray'>

... and here is the code that checks IOException (and other errors) while looping through.

with open ("./my-numpy-arrays.bin", 'rb') as f:
    while True:
        try :   
            numpy_array = np.load(f)
            print numpy_array
        except : 
            break

[1 2 3]
[0 9]

Not very pretty but ... it works.

Edward Aung
  • 3,014
  • 1
  • 12
  • 15
  • Thanks @Edward. This is one solution, but have the risk of out of memory error. Think about a numpy array file extremely large ... (like 100 GB? Although I am not using such a large file ...) – JQK Jun 29 '18 at 04:40
  • Since you do not save the offset of each array in the file, the only way we can do is to seek it this way. Practically speaking, if I were to encounter a similar situation, I would just divide the file into multiple files (if space permits) or I would build an index on the starting position of each member array and seek for subsequent accesses. Either way, you can't get away from reading it once. My loop does not grow as it reuses the variable. – Edward Aung Jun 29 '18 at 04:50
  • Thanks @Edward. You are right, it will not grow, but if, say I want to dynamically load 100 arrays each time, I might need to reload the first 100 arrays each time I load the following 100 arrays, which is very inefficient. Maybe you are right, it is impossible to do what I desired without properly saving in advance. Thanks again. – JQK Jun 29 '18 at 19:08