2

I have ~650,000 image files that I convert to numpy arrays with cv2. The images are arranged into subfolders with ~10k images in each. Each image is tiny; about 600 bytes (2x100 pixels RGB).

When I read them all using:

cv2.imread()

It takes half a second per 10k images, under a minute for all 650k... except after I restart my machine. Then it takes 20-50 seconds per 10k images the first time I run my script after reboot; half an hour or so for the full read.

Why?

How can I keep them rapidly accessible after restart without the arduously slow initial read?

The database of historic images grows daily; older ones do not get re written.

code:

print 'Building historic database...'
elapsed = elapsed2 = time.time()
def get_immediate_subdirectories(a_dir):
    return [name for name in os.listdir(a_dir)
            if os.path.isdir(os.path.join(a_dir, name))]
compare = get_immediate_subdirectories('images_old')
compare.sort()

images = []
for j in compare:
    begin = 1417024800
    end =  1500000000
    if ASSET == j:
        end = int(time.time()-86400*30)
    tally = 0
    for i in range (begin, end, 7200):
        try:
            im = cv2.imread("images_old/%s/%s_%s.png" % (j,j,i))
            im = np.ndarray.flatten(im)
            if im is not None:  
                images.append([j,i,im])
                tally+=1
        except: pass
    print  j.ljust(5), ('cv2 imread elapsed: %.2f items: %s' % ((time.time()-elapsed),tally))
    elapsed = time.time()
print '%.2f cv2 imread big data: %s X %s items' % ((time.time()-elapsed2),len(images),len(a1))
elapsed = time.time()

amd fm2+ 16GB linux mint 17.3 python 2.7

litepresence
  • 3,109
  • 1
  • 27
  • 35
  • That's less than 600MB. How about an SSD? A few milliseconds of rotational and seek delay multiplied by 650,000 soon adds up. – Mark Setchell May 29 '17 at 13:48
  • Consider initializing and assigning. Related post - https://stackoverflow.com/questions/44078327/fastest-approach-to-read-thousands-of-images-into-one-big-numpy-array – Divakar May 29 '17 at 14:06
  • As a quick test, you could make a 1GB RAMdrive, copy your images to that, empty the buffer cache and start your application loading from RAMdrive. That will be an indicator of the very, very best you could possibly hope to nearly approach from SSD/NVME. – Mark Setchell May 29 '17 at 14:16
  • 1
    Have you considered a different representation of the data? After the day has passes, it sounds like you'll only every read the stuff, so might as well make it efficient to read. Say clump every directory into a single image accompanied by index/metadata file. The situation you describe sounds rather obscene :D And inefficient, be it space or decoding overhead. – Dan Mašek May 29 '17 at 17:22
  • Regarding reboot: it's because your operation system caches everything that it can. To flush all cache at linux `sync; echo 3 > /proc/sys/vm/drop_caches` – RedEyed Dec 15 '19 at 19:17

4 Answers4

2

I would like to suggest a concept based on REDIS which is like a database but actually a "data structure server" wherein the data structures are your 600 byte images. I am not suggesting for a minute that you rely on REDIS as a permanent storage system, rather continue to use your 650,000 files but cache them in REDIS which is free and available for Linux, macOS and Windows.

So, basically, at any point in the day, you could copy your images into REDIS ready for the next restart.

I don't speak Python, but here is a Perl script that I used to generate 650,000 images of 600 random bytes each and insert them into a REDIS hash. The corresponding Python would be pretty easy to write:

#!/usr/bin/perl
################################################################################
# generator <number of images> <image size in bytes>
# Mark Setchell
# Generates and sends "images" of specified size to REDIS
################################################################################
use strict;
use warnings FATAL => 'all';
use Redis;
use Time::HiRes qw(time);

my $Debug=1;    # set to 1 for debug messages

my $nargs = $#ARGV + 1;
if ($nargs != 2) {
    print "Usage: generator <number of images> <image size in bytes>\n";
    exit 1;
}

my $nimages=$ARGV[0];
my $imsize=$ARGV[1];
my @bytes=(q(a)..q(z),q(A)..q(Z),q(0)..q(9));
my $bl = scalar @bytes - 1;

printf "DEBUG: images: $nimages, size: $imsize\n" if $Debug;

# Connection to REDIS
my $redis = Redis->new;
my $start=time;

for(my $i=0;$i<$nimages;$i++){
   # Generate our 600 byte "image"
   my $image;
   for(my $j=0;$j<$imsize;$j++){
      $image .= $bytes[rand $bl];
   }
   # Load it into a REDIS hash called 'im' indexed by an integer number
   $redis->hset('im',$i,$image);
   print "DEBUG: Sending key:images, field:$i, value:$image\n" if $Debug;
}
my $elapsed=time-$start;
printf "DEBUG: Sent $nimages images of $imsize bytes in %.3f seconds, %d images/s\n",$elapsed,int($nimages/$elapsed)

So, you can insert the 650,000 images of 600 bytes each into a REDIS hash called "im" indexed by a simple number [1..650000].

Now, if you stop REDIS and check the size of the database, it is 376MB:

ls -lhrt dump.rb

-rw-r--r--  1 mark  admin   376M 29 May 20:00 dump.rdb

If you now kill REDIS, and restart it, it takes 2.862 seconds to start and load the 650,000 image database:

redis-server /usr/local/etc/redis.conf

                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 3.2.9 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 33802
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

33802:M 29 May 20:00:57.698 # Server started, Redis version 3.2.9
33802:M 29 May 20:01:00.560 * DB loaded from disk: 2.862 seconds
33802:M 29 May 20:01:00.560 * The server is now ready to accept connections on port 6379

So, you could start REDIS in under 3 seconds after reboot. Then you can query and load the 650,000 images like this:

#!/usr/bin/perl
################################################################################
# reader
# Mark Setchell
# Reads specified number of images from Redis
################################################################################
use strict;
use warnings FATAL => 'all';
use Redis;
use Time::HiRes qw(time);

my $Debug=0;    # set to 1 for debug messages
my $nargs = $#ARGV + 1;
if ($nargs != 1) {
    print "Usage: reader <number of images>\n";
    exit 1;
}

my $nimages=$ARGV[0];

# Connection to REDIS
my $redis = Redis->new;
my $start=time;

for(my $i=0;$i<$nimages;$i++){
   # Retrive image from hash named "im" with key=$1
   my $image = $redis->hget('im',$i);
   print "DEBUG: Received image $i\n" if $Debug;
}
my $elapsed=time-$start;
printf "DEBUG: Received $nimages images in %.3f seconds, %d images/s\n",$elapsed,int($nimages/$elapsed)

And that reads 650,000 images of 600 bytes each in 61 seconds on my Mac, so your total startup time would be 64 seconds.

Sorry, I don't know enough Python yet to do it in Python but I suspect the times would be pretty similar.

I am basically using a REDIS hash called "im", with hset and hget and am indexing the images by a simple integer. However, REDIS keys are binary safe, so you could use filenames as keys instead of integers. You can also interact with REDIS at the command-line (without Python or Perl), so you can get a list of the 650,000 keys (filenames) at the command line with:

redis-cli <<< "hkeys im"

or retrieve a single image (with key/filename="1") with:

 redis-cli <<< "hget 'im' 1"

If you don't have bash, you could do:

echo "hget 'im' 1" | redis-cli

or

echo "hkeys im" | redis-cli

I was just reading about persisting/serializing Numpy arrays, so that may be an even simpler option than involving REDIS... see here.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
2

I was thinking overnight and have an even simpler, faster solution...

Basically, at any point you like during the day, you parse the file system of your existing image files and make a flattened representation of them in two files. Then, when you start up, you just read the flattened representation which is a single 300MB contiguous file on disk that can be read in 2-3 seconds.

So, the first file is called "flat.txt" and it contains a single line for each file, like this but actually 650,000 lines long:

filename:width:height:size
filename:width:height:size
...
filename:width:height:size

The second file is just a binary file with the contents of each of the listed files appended to it - so it is a contiguous 360 MB binary file called "flat.bin".

Here's is how I create the two files in Perl using this script called flattener.pl

#!/usr/bin/perl
use strict;
use warnings;

use File::Find;

# Names of the index and bin files
my $idxname="flat.txt";
my $binname="flat.bin";

# Open index file, which will have format:
#    fullpath:width:height:size
#    fullpath:width:height:size
open(my $idx,'>',$idxname);

# Open binary file - simply all images concatenated
open(my $bin,'>',$binname);

# Save time we started parsing filesystem
my $atime = my $mtime = time;

find(sub {
  # Only parse actual files (not directories) with extension "png"
  if (-f and /\.png$/) {
    # Get full path filename, filesize in bytes
    my $path   = $File::Find::name;
    my $nbytes = -s;
    # Write name and vital statistics to index file
    print $idx "$path:100:2:$nbytes\n";
    # Slurp entire file and append to binary file
    my $image = do {
       local $/ = undef;
       open my $fh, "<", $path;
       <$fh>;
    };
    print $bin $image;
  }
}, '/path/to/top/directory');

close($idx);
close($bin);

# Set atime and mtime of index file to match time we started parsing
utime $atime, $mtime, $idxname || warn "Couldn't touch $idxname: $!";

Then, when you want to start up, you run the loader.pl which is like this:

#!/usr/bin/perl
use strict;
use warnings;

# Open index file, which will have format:
#    fullpath:width:height:size
#    fullpath:width:height:size
open(my $idx, '<', 'flat.txt');

# Open binary file - simply all images concatenated
open(my $bin, '<', 'flat.bin');

# Read index file, one line at a time
my $total=0;
my $nfiles=0;
while ( my $line = <$idx> ) {
    # Remove CR or LF from end of line
    chomp $line;

    # Parse line into: filename, width, height and size
    my ($name,$width,$height,$size) = split(":",$line);

    print "Reading file: $name, $width x $height, bytes:$size\n";
    my $bytes_read = read $bin, my $bytes, $size;
    if($bytes_read != $size){
       print "ERROR: File=$name, expected size=$size, actually read=$bytes_read\n"
    }
    $total += $bytes_read;
    $nfiles++;
}
print "Read $nfiles files, and $total bytes\n";

close($idx);
close($bin);

And that takes under 3 seconds with 497,000 files of 600 bytes each.


So, what about files that have changed since you ran the flattener.pl script. Well, at the start of the flattener.pl script, I get the system time in seconds since the epoch. Then, at the end, when I have finished parsing 650,000 files and have written the flattened files out, I set their modification time back to just before I started parsing. Then in your code, all you need to do is load the files using the loader.pl script, then do a quick find of all image files newer than the index file and load those few extra files using your existing method.

In bash, that would be:

find . -newer flat.txt -print

As you are reading images with OpenCV, you will need to do an imdecode() on the raw file data, so I would benchmark whether you want to do that whilst flattening or whilst loading.


Again, sorry it is in Perl, but I am sure it can be done just the same in Python.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Yep, something along the lines of what I was thinking. IMHO I'd store the decoded (and perhaps flattend) image data. 200 RGB pixels probably won't have much redundancy for zllb to take advantage of -- with random pixels there's a 70 byte overhead in storage. Then for each read you have the decompression overhead per image -- not really worth it in that case. I'll post my take at this soon, once the benchmarks finish - the single file approach is so damn slow... – Dan Mašek May 31 '17 at 02:02
  • 1
    @DanMašek **GNU Parallel** is good for making 650,000 files :-) – Mark Setchell May 31 '17 at 08:10
1

Did you check that disk is not the bottleneck? Image files could be cached by OS after the first read and then used from memory. If all your files are large enough (10-20Gb) it could take several minutes for slow HDD to read.

varela
  • 1,281
  • 1
  • 10
  • 16
  • The data is on a platter. I'll work on moving it over to an SSD and see if that helps. Its very slow to move probably 48 hours+ I moved it originally from SSD - to fast thumb drive - to this platter. both in and out of thumb were pretty equal in time elapsed. Is there some way for me to specilfy "preserve this cached object persistent through system restart"? – litepresence May 29 '17 at 14:03
  • You can use 2 options to preserve memory state: 1st read all the image files upon startup, 2nd: use suspend instead of restart. But the ultimate solve will be migration to SSD. If size of the images will be greater than size of RAM you will start reading from disk and this will be the same slow as read first. – varela May 29 '17 at 14:15
0

Have you tried data parallelism on your for j in compare: loop to mitigate the HDD access bottleneck? multiprocessing can be used to perform one task per CPU core (or hardware thread). See this using-multiprocessing-queue-pool-and-locking for some example.

If you have a Intel i7 with 8 virtual cores, the elapsed time may reduce to 1/8 theoretically. Actual time shortened would also depend on the access time of your HDD or SSD, and SATA interface type, etc.

thewaywewere
  • 8,128
  • 11
  • 41
  • 46