I would like to suggest a concept based on REDIS which is like a database but actually a "data structure server" wherein the data structures are your 600 byte images. I am not suggesting for a minute that you rely on REDIS as a permanent storage system, rather continue to use your 650,000 files but cache them in REDIS which is free and available for Linux, macOS and Windows.
So, basically, at any point in the day, you could copy your images into REDIS ready for the next restart.
I don't speak Python, but here is a Perl script that I used to generate 650,000 images of 600 random bytes each and insert them into a REDIS hash. The corresponding Python would be pretty easy to write:
#!/usr/bin/perl
################################################################################
# generator <number of images> <image size in bytes>
# Mark Setchell
# Generates and sends "images" of specified size to REDIS
################################################################################
use strict;
use warnings FATAL => 'all';
use Redis;
use Time::HiRes qw(time);
my $Debug=1; # set to 1 for debug messages
my $nargs = $#ARGV + 1;
if ($nargs != 2) {
print "Usage: generator <number of images> <image size in bytes>\n";
exit 1;
}
my $nimages=$ARGV[0];
my $imsize=$ARGV[1];
my @bytes=(q(a)..q(z),q(A)..q(Z),q(0)..q(9));
my $bl = scalar @bytes - 1;
printf "DEBUG: images: $nimages, size: $imsize\n" if $Debug;
# Connection to REDIS
my $redis = Redis->new;
my $start=time;
for(my $i=0;$i<$nimages;$i++){
# Generate our 600 byte "image"
my $image;
for(my $j=0;$j<$imsize;$j++){
$image .= $bytes[rand $bl];
}
# Load it into a REDIS hash called 'im' indexed by an integer number
$redis->hset('im',$i,$image);
print "DEBUG: Sending key:images, field:$i, value:$image\n" if $Debug;
}
my $elapsed=time-$start;
printf "DEBUG: Sent $nimages images of $imsize bytes in %.3f seconds, %d images/s\n",$elapsed,int($nimages/$elapsed)
So, you can insert the 650,000 images of 600 bytes each into a REDIS hash called "im" indexed by a simple number [1..650000].
Now, if you stop REDIS and check the size of the database, it is 376MB:
ls -lhrt dump.rb
-rw-r--r-- 1 mark admin 376M 29 May 20:00 dump.rdb
If you now kill REDIS, and restart it, it takes 2.862 seconds to start and load the 650,000 image database:
redis-server /usr/local/etc/redis.conf
_._
_.-``__ ''-._
_.-`` `. `_. ''-._ Redis 3.2.9 (00000000/0) 64 bit
.-`` .-```. ```\/ _.,_ ''-._
( ' , .-` | `, ) Running in standalone mode
|`-._`-...-` __...-.``-._|'` _.-'| Port: 6379
| `-._ `._ / _.-' | PID: 33802
`-._ `-._ `-./ _.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' | http://redis.io
`-._ `-._`-.__.-'_.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' |
`-._ `-._`-.__.-'_.-' _.-'
`-._ `-.__.-' _.-'
`-._ _.-'
`-.__.-'
33802:M 29 May 20:00:57.698 # Server started, Redis version 3.2.9
33802:M 29 May 20:01:00.560 * DB loaded from disk: 2.862 seconds
33802:M 29 May 20:01:00.560 * The server is now ready to accept connections on port 6379
So, you could start REDIS in under 3 seconds after reboot. Then you can query and load the 650,000 images like this:
#!/usr/bin/perl
################################################################################
# reader
# Mark Setchell
# Reads specified number of images from Redis
################################################################################
use strict;
use warnings FATAL => 'all';
use Redis;
use Time::HiRes qw(time);
my $Debug=0; # set to 1 for debug messages
my $nargs = $#ARGV + 1;
if ($nargs != 1) {
print "Usage: reader <number of images>\n";
exit 1;
}
my $nimages=$ARGV[0];
# Connection to REDIS
my $redis = Redis->new;
my $start=time;
for(my $i=0;$i<$nimages;$i++){
# Retrive image from hash named "im" with key=$1
my $image = $redis->hget('im',$i);
print "DEBUG: Received image $i\n" if $Debug;
}
my $elapsed=time-$start;
printf "DEBUG: Received $nimages images in %.3f seconds, %d images/s\n",$elapsed,int($nimages/$elapsed)
And that reads 650,000 images of 600 bytes each in 61 seconds on my Mac, so your total startup time would be 64 seconds.
Sorry, I don't know enough Python yet to do it in Python but I suspect the times would be pretty similar.
I am basically using a REDIS hash called "im", with hset
and hget
and am indexing the images by a simple integer. However, REDIS keys are binary safe, so you could use filenames as keys instead of integers. You can also interact with REDIS at the command-line (without Python or Perl), so you can get a list of the 650,000 keys (filenames) at the command line with:
redis-cli <<< "hkeys im"
or retrieve a single image (with key/filename="1") with:
redis-cli <<< "hget 'im' 1"
If you don't have bash
, you could do:
echo "hget 'im' 1" | redis-cli
or
echo "hkeys im" | redis-cli
I was just reading about persisting/serializing Numpy arrays, so that may be an even simpler option than involving REDIS... see here.