2

I'm currently using pickle/joblib (but I'm flexible with library) to load a large numpy array. I have a SSD with 500MB/s read speed.

I'm hoping to read my numpy array faster.

Before investing in a new SSD, I'm wondering if the new SSDs with 1000MB/s-3000MB/s read speeds would actually allow me to read the numpy arrays faster. Are the pickle/joblib libraries themselves limited in read speed?

I have confirmed my current SSD is reading at about 400-500MB/s.

Would I get 1000MB/s if I bought a 1000MB/s SSD?

Kevin
  • 281
  • 2
  • 5
  • 1
    Almost always disks are the bottleneck. – Barmar May 20 '20 at 00:33
  • 1
    @Barmar true, but it all depends on how much work you need to do per byte. Pickling doesn't feel like it would be low overhead, but I'd expect `numpy` to do a good job. – Mark Ransom May 20 '20 at 00:37
  • you want to check your CPU usage while loading the data. if the CPU is maxed out then getting a faster disk won't help, if your CPU is at 50% then you might be able to get twice as fast – Sam Mason May 20 '20 at 00:39
  • And if the loading speed is about the same as the disk speed, a faster disk is likely to help. – Barmar May 20 '20 at 00:41
  • @SamMason you need to be careful about what "maxed out" means. 50% might be maxed out for a dual core processor. – Mark Ransom May 20 '20 at 00:57
  • 1
    You are using the entire disk channel so a faster SSD will very likely help. There are other ways to save arrays that should be faster than pickle. `array.tofile` as a binary array and then reading with numpy.memmap for instance, if you are dealing with basic types like ints or floats. – tdelaney May 20 '20 at 00:59
  • @SamMason - OP says _I have confirmed my current SSD is reading at about 400-500MB/s_ ... I assume that was checked while reading the array. – tdelaney May 20 '20 at 01:08
  • @tdelaney oops, yup, think that edit appeared after I read the question the first time – Sam Mason May 20 '20 at 01:11
  • For non-compressible data simply use np.save, np.load (without any pickling). This should be as fast as your SSD can be. For compressible data you likely can exceed the disk throughput eg. https://stackoverflow.com/a/56761075/4045774 – max9111 May 21 '20 at 18:47

0 Answers0