25

This might be a silly question, but...

I have several thousand images that I would like to load into Python and then convert into numpy arrays. Obviously this goes a little slowly. But, I am actually only interested in a small portion of each image. (The same portion, just 100x100 pixels in the center of the image.)

Is there any way to load just part of the image to make things go faster?

Here is some sample code where I generate some sample images, save them, and load them back in.

import numpy as np
import matplotlib.pyplot as plt
import Image, time

#Generate sample images
num_images = 5

for i in range(0,num_images):
    Z = np.random.rand(2000,2000)
    print 'saving %i'%i
    plt.imsave('%03i.png'%i,Z)

%load the images
for i in range(0,num_images):
    t = time.time()

    im = Image.open('%03i.png'%i)
    w,h = im.size
    imc = im.crop((w-50,h-50,w+50,h+50))

    print 'Time to open: %.4f seconds'%(time.time()-t)

    #convert them to numpy arrays
    data = np.array(imc)
DanHickstein
  • 6,588
  • 13
  • 54
  • 90
  • 3
    Im pretty sure you cant but I would love to be proved wrong on this one – Joran Beasley Oct 30 '13 at 22:52
  • 1
    You would have to open the file as a raw binary file and then use file.seek() etc to access he bits you want – avrono Oct 30 '13 at 22:55
  • 1
    @avrono yeah but the question is really then how to tell which bits make up the center of an image (irregardless of image dimensions) for at least one image type – Joran Beasley Oct 30 '13 at 23:10
  • 3
    Its even more complicated to find the specific bytes since it looks like he's using PNGs which are zlib compressed. – kalhartt Oct 30 '13 at 23:11
  • 1
    is png a bitmap? , i don't think so. it is compressed , so you will have to do something beefore you get at the bits. could your images be bitmaps ? – vish Oct 31 '13 at 14:44
  • 1
    I could potentially save the data to another format: tif, bmp, etc., though I would not like to do anything that involves lossy compression (jpeg), since these images are recordings of our experimental data, and I don't want to throw away any information. – DanHickstein Oct 31 '13 at 17:51
  • @DanHickstein: if you're willing to save all the data to .bmp then it would be really easy to do what you ask. the bmp file format is quite straightforward. you'd just have to do some simple calculations to figure out where the data you want is and then seek to and read only those bytes – Claudiu Nov 01 '13 at 18:24
  • @Claudiu, yes, I could use bmp instead. I'm still not sure how to implement reading only the portion that I want. If you have time, can you post it as an answer? – DanHickstein Nov 04 '13 at 15:41
  • @DanHickstein: sure, i'll at least post a rough sketch of how to go about it, as I don't have time to implement it fully now. maybe will flesh it out later – Claudiu Nov 04 '13 at 16:05

4 Answers4

10

Save your files as uncompressed 24-bit BMPs. These store pixel data in a very regular way. Check out the "Image Data" portion of this diagram from Wikipedia. Note that most of the complexity in the diagram is just from the headers:

BMP file format

For example, let's say you are storing this image (here shown zoomed in):

2x2 square image

This is what the pixel data section looks like, if it's stored as a 24-bit uncompressed BMP. Note that the data is stored bottom-up, for some reason, and in BGR form instead of RGB, so the first line in the file is the bottom-most line of the image, the second line is the second-bottom-most, etc:

00 00 FF    FF FF FF    00 00
FF 00 00    00 FF 00    00 00

That data is explained as follows:

           |  First column  |  Second Column  |  Padding
-----------+----------------+-----------------+-----------
Second Row |  00 00 FF      |  FF FF FF       |  00 00
-----------+----------------+-----------------+-----------
First Row  |  FF 00 00      |  00 FF 00       |  00 00
-----------+----------------+-----------------+-----------

or:

           |  First column  |  Second Column  |  Padding
-----------+----------------+-----------------+-----------
Second Row |  red           |  white          |  00 00
-----------+----------------+-----------------+-----------
First Row  |  blue          |  green          |  00 00
-----------+----------------+-----------------+-----------

The padding is there to pad the row size to a multiple of 4 bytes.


So, all you have to do is implement a reader for this particular file format, and then calculate the byte offset of where you have to start and stop reading each row:

def calc_bytes_per_row(width, bytes_per_pixel):
    res = width * bytes_per_pixel
    if res % 4 != 0:
        res += 4 - res % 4
    return res

def calc_row_offsets(pixel_array_offset, bmp_width, bmp_height, x, y, row_width):
    if x + row_width > bmp_width:
        raise ValueError("This is only for calculating offsets within a row")

    bytes_per_row = calc_bytes_per_row(bmp_width, 3)
    whole_row_offset = pixel_array_offset + bytes_per_row * (bmp_height - y - 1)
    start_row_offset = whole_row_offset + x * 3
    end_row_offset = start_row_offset + row_width * 3
    return (start_row_offset, end_row_offset)

Then you just have to process the proper byte offsets. For example, say you want to read the 400x400 chunk starting at position 500x500 in a 10000x10000 bitmap:

def process_row_bytes(row_bytes):
    ... some efficient way to process the bytes ...

bmpf = open(..., "rb")
pixel_array_offset = ... extract from bmp header ...
bmp_width = 10000
bmp_height = 10000
start_x = 500
start_y = 500
end_x = 500 + 400
end_y = 500 + 400

for cur_y in xrange(start_y, end_y):
    start, end = calc_row_offsets(pixel_array_offset, 
                                  bmp_width, bmp_height, 
                                  start_x, cur_y, 
                                  end_x - start_x)
    bmpf.seek(start)
    cur_row_bytes = bmpf.read(end - start)
    process_row_bytes(cur_row_bytes)

Note that it's important how you process the bytes. You can probably do something clever using PIL and just dumping the pixel data into it but I'm not entirely sure. If you do it in an inefficient manner then it might not be worth it. If speed is a huge concern, you might consider writing it with pyrex or implementing the above in C and just calling it from Python.

Community
  • 1
  • 1
Claudiu
  • 224,032
  • 165
  • 485
  • 680
  • Oh, this looks very interesting. I will need to look at this more tonight and see if I can get it to work. Thanks for such a thorough response! – DanHickstein Nov 04 '13 at 20:39
9

While you can't get much faster than PIL crop in a single thread, you can use multiple cores to speed up everything! :)

I ran the below code on my 8 core i7 machine as well as my 7 year old, two core, barely 2ghz laptop. Both saw significant improvements in run time. Much as you would expect, the improvement was dependent on the number of cores available.

The core of your code is the same, I just separated the looping from the actual computation so that the function could be applies to a list of values in parallel.

So, this:

for i in range(0,num_images):
    t = time.time()

    im = Image.open('%03i.png'%i)
    w,h = im.size
    imc = im.crop((w-50,h-50,w+50,h+50))

    print 'Time to open: %.4f seconds'%(time.time()-t)

    #convert them to numpy arrays
    data = np.array(imc)

Became:

def convert(filename):  
    im = Image.open(filename)
    w,h = im.size
    imc = im.crop((w-50,h-50,w+50,h+50))
    return numpy.array(imc)

The key to the speedup is the Pool feature of the multiprocessing library. It makes it trivial to run things across multiple processors.

Full code:

import os 
import time
import numpy 
from PIL import Image
from multiprocessing import Pool 

# Path to where my test images are stored
img_folder = os.path.join(os.getcwd(), 'test_images')

# Collects all of the filenames for the images
# I want to process
images = [os.path.join(img_folder,f) 
        for f in os.listdir(img_folder)
        if '.jpeg' in f]

# Your code, but wrapped up in a function       
def convert(filename):  
    im = Image.open(filename)
    w,h = im.size
    imc = im.crop((w-50,h-50,w+50,h+50))
    return numpy.array(imc)

def main():
    # This is the hero of the code. It creates pool of 
    # worker processes across which you can "map" a function
    pool = Pool()

    t = time.time()
    # We run it normally (single core) first
    np_arrays = map(convert, images)
    print 'Time to open %i images in single thread: %.4f seconds'%(len(images), time.time()-t)

    t = time.time()
    # now we run the same thing, but this time leveraging the worker pool.
    np_arrays = pool.map(convert, images)
    print 'Time to open %i images with multiple threads: %.4f seconds'%(len(images), time.time()-t)

if __name__ == '__main__':
    main()

Pretty basic. Only a few extra lines of code, and a little refactoring to move the conversion bit into its own function. The results speak for themselves:

Results:

8-Core i7

Time to open 858 images in single thread: 6.0040 seconds
Time to open 858 images with multiple threads: 1.4800 seconds

2-Core Intel Duo

Time to open 858 images in single thread: 8.7640 seconds
Time to open 858 images with multiple threads: 4.6440 seconds

So there ya go! Even if you have a super old 2 core machine you can halve the time you spend opening and processing your images.

Caveats

Memory. If you're processing 1000s of images, you're probably going to pop Pythons Memory limit at some point. To get around this, you'll just have to process the data in chunks. You can still leverage all of the multiprocessing goodness, just in smaller bites. Something like:

for i in range(0, len(images), chunk_size): 
    results = pool.map(convert, images[i : i+chunk_size]) 
    # rest of code. 
Audionautics
  • 530
  • 4
  • 12
  • Oh, this is really interesting. I thought that I would be limited by the disk read rate, but this makes it seem like I am actually limited by the cropping function? Very good practical way to achieve a nice speedup. Thanks! – DanHickstein Nov 04 '13 at 15:44
  • 1
    @DanHickstein Measure everything :) I have a very similar script to yours at work (e.g. open/crop/process). I thought for sure the bottle neck was actually loading the images off the disk (being that there are tens of thousands of them). However, after a quick line_profile with kernprof, I realized that images can be read absurdly fast off disc (at least on my setup). That said, If you find read IO is still a problem, you can use `multiprocessing.dummy.Pool` to easily split the IO across multiple threads. – Audionautics Nov 04 '13 at 17:09
  • Note for Python3: map is lazy [Python: Map calling a function not working](https://stackoverflow.com/a/19342377/3972710) so the first mono-threaded call does not give any results – NGI Dec 09 '21 at 20:45
4

Oh I just realized there might be a far, far simpler way than doing what I wrote above regarding the BMP files.

If you are generating the image files anyway, and you always know which portion you want to read, simply save that portion out as another image file while you're generating it:

import numpy as np
import matplotlib.pyplot as plt
import Image

#Generate sample images
num_images = 5

for i in range(0,num_images):
    Z = np.random.rand(2000, 2000)
    plt.imsave('%03i.png'%i, Z)
    snipZ = Z[200:300, 200:300]
    plt.imsave('%03i.snip.png'%i, snipZ)

#load the images
for i in range(0,num_images):
    im = Image.open('%03i.snip.png'%i)

    #convert them to numpy arrays
    data = np.array(im)
Claudiu
  • 224,032
  • 165
  • 485
  • 680
  • Ah, yes, this is a good sanity check and would work in some circumstances. But, more generally I want to save the full-size images and then define the region-of-interest later. So, the region of interest will be the same for all of the images in a set when I am reading them in, but I cannot easily define the ROI at the time of saving the images. – DanHickstein Nov 04 '13 at 20:36
1

I have run some timing tests and I am sorry to say I don't think you can get much faster than the PIL crop command. Even with manual seeking/low level reading you still have to read the bytes. Here is the timing results:

%timeit im.crop((1000-50,1000-50,1000+50,1000+50))
fid = open('003.png','rb')
%timeit fid.seek(1000000)
%timeit fid.read(1)
print('333*100*100/10**(9)*1000=%.2f ms'%(333*100*100/10**(9)*1000))


100000 loops, best of 3: 3.71 us per loop
1000000 loops, best of 3: 562 ns per loop
1000000 loops, best of 3: 330 ns per loop
333*100*100/10**(9)*1000=3.33 ms

As can be seen the bottom calculation we have a read 1 byte *10000 bytes (100x100 subimage)*333ns per byte=3.33ms which is the same as the crop command above

Paul
  • 7,155
  • 8
  • 41
  • 40
  • Okay, good to have some independent confirmation that I am already at the local maxima for speediness. I will accept this answer in a few days if there is no other solution. Thanks! – DanHickstein Oct 31 '13 at 14:36
  • 3
    -1, this is not a good comparison. at the `im.crop` point, the image is already loaded. `im.crop` just [returns a proxy object](http://hg.effbot.org/pil-117/src/770a9717f387cb878fc44b0e9be88e677b9abcd3/PIL/Image.py?at=default#cl-734) - it's practically a no-op. a fair comparison would be loading the whole image & then cropping & then converting to an array, vs. reading just the relevant bytes & converting those into an array. – Claudiu Nov 04 '13 at 16:50