0

I am training PyTorch models on various datasets. The datasets up to this point have been images so I can just read them on the fly when needed using cv2 or PIL which is fast.

Now I am presented with a dataset of tensor objects of shape [400, 400, 8]. In the past I have tried to load these objects using PyTorch and NumPy's built-in tensor reading operations but these are generally much slower than reading images.

The objects are currently stored in h5py compressed files where there are ~800 per file. My plan was to save the objects individually in some format and then read them on the fly but I am unsure of what format to save them in which is fastest.

I would like to avoid keeping them all in memory as I believe the memory requirement would be too high.

MWB
  • 11,740
  • 6
  • 46
  • 91
mkohler
  • 139
  • 5
  • You can save numpy arrays in a compressed format. The speed generally depends on the data. eg. This solution also outperforms image compression libraries: https://stackoverflow.com/a/56761075/4045774 – max9111 Sep 27 '21 at 12:44

1 Answers1

1

If the data arrays are still "images", just 8-channel ones, you can split them into 3 image files

a = x[:, :, 0:3]
b = x[:, :, 3:6]
c = x[:, :, 5:8]
c[:, :, 0] = 0 # reduces the compressed size

and store them using the conventional image libraries (cv2 and PIL).

Images compress much better than general data (lossy 'jpeg' even more so), and thefore that reduces both the disk space and bandwidth, and has file system caching benefits.

MWB
  • 11,740
  • 6
  • 46
  • 91
  • I'm accepting this as I like the solution for my problem where the data is in this format. I suppose it would be possible to reframe any tensor object as (N,H,W,3) shape and just reshape upon loading it. – mkohler Sep 23 '21 at 17:26