5

I've tried this method outlined by Hpaulji but it doesn't seem to working:

How to append many numpy files into one numpy file in python

Basically, I'm iterating through a generator, making some changes to an array, and then trying to save the each iteration's array.

Here is what my sample code looks like:

filename = 'testing.npy'

with open(filename, 'wb') as f:
    for x, _ in train_generator:
        prediction = base_model.predict(x)
        print(prediction[0,0,0,0:5])
        np.save(filename, prediction)

        current_iteration += 1
    if current_iteration == 5:
        break

Here, I'm going through 5 iterations, so I was hoping to save 5 different arrays.

I printed out a portion of each array, for debugging purposes:

[ 0.  0.  0.  0.  0.]
[ 0.          3.37349415  0.          0.          1.62561738]
[  0.          20.28489304   0.           0.           0.        ]
[ 0.  0.  0.  0.  0.]
[  0.          21.98013496   0.           0.           0.        ]

But when I tried to load the array, multiple times as noted here, How to append many numpy files into one numpy file in python, I'm getting an EOFERROR:

file = r'testing.npy'

with open(file,'rb') as f:
    arr = np.load(f)
    print(arr[0,0,0,0:5])
    arr = np.load(f)
    print(arr[0,0,0,0:5])

It's only outputting the last array and then an EOFERROR:

[  0.          21.98013496   0.           0.           0.        ]
EOFError: Ran out of input

print(arr[0,0,0,0:5])

I was expection all 5 arrays to be saved, but when I load the save .npy file multiple times, I only get the last array.

So, how should I be saving saving and appending new array to a file?

EDIT: Testing with '.npz' only saves last array

filename = 'testing.npz'

current_iteration = 0
with open(filename, 'wb') as f:
    for x, _ in train_generator:
        prediction = base_model.predict(x)
        print(prediction[0,0,0,0:5])
        np.savez(f, prediction)



        current_iteration += 1
        if current_iteration == 5:
            break


#loading

    file = 'testing.npz'

    with open(file,'rb') as f:
        arr = np.load(f)
        print(arr.keys())


>>>['arr_0']
Moondra
  • 4,399
  • 9
  • 46
  • 104
  • as an aside, i don't know how large your date is, but have you tried HDF5, or are you tied to `.npy` for storage? – jpp Feb 03 '18 at 23:27
  • I haven't tried HDF5. I seems that is the better option( my data is about 100,000 images) but I would have to do a little more digging through the docs as I'm not as as familiar with HDF5. – Moondra Feb 03 '18 at 23:30
  • ok, unfortunately i can't help with your question, but look up h5py documentation, the syntax is easy to pick up to start storing / appending numeric data, and if used correctly can be fast. – jpp Feb 03 '18 at 23:34
  • @jp_data_analysis Thanks, I think I may just switch to HDF5 as it's more widely used. – Moondra Feb 03 '18 at 23:44

1 Answers1

3

All your calls to np.save use the filename, not the filehandle. Since you do not reuse the filehandle, each save overwrites the file instead of appending the array to it.

This should work:

filename = 'testing.npy'

with open(filename, 'wb') as f:
    for x, _ in train_generator:
        prediction = base_model.predict(x)
        print(prediction[0,0,0,0:5])
        np.save(f, prediction)

        current_iteration += 1
    if current_iteration == 5:
        break

And while there may be advantages to storing multiple arrays in one .npy file (I imagine advantages in situations where memory is limited), they are technically meant to store one single array, and you can use .npz files (np.savez or np.savez_compressed) to store multiple arrays:

filename = 'testing.npz'
predictions = []
for (x, _), index in zip(train_generator, range(5)):
    prediction = base_model.predict(x)
    predictions.append(prediction)
np.savez(filename, predictions) # will name it arr_0
# np.savez(filename, predictions=predictions) # would name it predictions
# np.savez(filename, *predictions) # would name it arr_0, arr_1, …, arr_4
YSelf
  • 2,646
  • 1
  • 14
  • 19
  • Ah! Thank you. I'm going to test it out when I get chance. – Moondra Feb 04 '18 at 01:22
  • I just tried `.npz` -- `testing.npz` and `np.savez(f, prediction)`, but it seems to be saving the last array only. I'm loading the array the same way as the code in the OP, but I only see one key -- ['arr_0']. I will update the OP just in case I'm making a mistake. – Moondra Feb 04 '18 at 01:39
  • 1
    I've added an example for npz-files. For that you only call `savez` once with all the entries (as one list of arrays or many arrays). – YSelf Feb 04 '18 at 02:56
  • @yself Thank you for listing out the multiple ways. – Moondra Feb 04 '18 at 04:10
  • @YSelf, this is a great answer as the docs don't say anything about saving lists of np arrays. I tried calling `np.savez(filename, next_array)` like an append to the file but obviously it does not work like that. – mLstudent33 Sep 04 '19 at 04:41
  • @YSelf, do I need to do `with open(filename, 'wb') as f:` for the npz example? It did not work without it. – mLstudent33 Sep 04 '19 at 05:17
  • The example works as written. The first argument to `np.savez` is either a string (the filename) or a file-like (an opened file). If you pass a file-like, it has to be opened with `wb`. – YSelf Sep 05 '19 at 12:34