Plotting thousands of files with python

Question

I have in the order or 10^5 binary files which I read one by one in a for loop with numpy's fromfile and plot with pyplot's imshow. Each file takes about a minute to read and plot.

Is there a way to speed things up?

Here is some pseudo code to explain my situation:

#!/usr/bin/env python

import numpy as np
import matplotlib as mpl
mpl.use('Agg')

import matplotlib.pyplot as plt

nx = 1200 ; ny = 1200

fig, ax = plt.subplots()
ax.set_xlabel('x')
ax.set_ylabel('y')

for f in files:
  data = np.fromfile(open(f,'rb'), dtype=float32, count=nx*ny)
  data.resize(nx,ny)
  im = ax.imshow(data)
  fig.savefig(f+'.png', dpi=300, bbox_inches='tight')
  im.remove()

I found the last step to be crucial so that memory does not explode.

I feel like it would be faster to use `C` to read the files, then something like python or R for the plotting. — Al.Sal, Aug 21 '14 at 17:27
I'm guessing the issue is that there are very large files? In which case I don't think there really is. You might be able to parallelize using `multiprocessing`. — Roger Fan, Aug 21 '14 at 17:29
when you run this you're not actually displaying the image to the screen, right? you're not running this from within `ipython --pylab` or with `plt.ion()` correct? — Ben, Aug 21 '14 at 17:30
@Al.Sal - `np.fromfile` is effectively identically in speed to doing the same thing in C. The bottleneck here is rendering the image with matplotlib, not in reading the data in. — Joe Kington, Aug 21 '14 at 18:05
@Ben, This code is run with a Canopy python interpreter, I forgot to mention that I call matplotlib.use('Agg') before importing pyplot. — Shahar, Aug 21 '14 at 18:59
@JoeKington, each file is about 5.7 MB (4*1200*1200). Reading a file takes about 12 seconds, the rest is indeed rendering. My question remains: any ideas how to speed thing up? Would it help if I parallelize the code so that processor #1 will work on the first 10^3 files, processor #2 on the next 10^3 files and so on? — Shahar, Aug 21 '14 at 19:06
@Shahar probably render each 5.7 MB file causing the problem as Joe Kington said earlier. — Tanmaya Meher, Aug 21 '14 at 19:08
@Shahar - I'm _very_ suprised that it's taking 12 seconds to read in a 5.7 MB file. (For comparison, reading in a 9GB file using the same method takes about 24 seconds on my system.) Is this over a network drive? That seems unusual, anyway... That aside, though, even if there is an IO bottlenect, `multiprocessing` should help here. Matplotlib will take quite awhile to render the image, and there's no reason that can't be done on multiple cores independently. — Joe Kington, Aug 21 '14 at 19:08
@Shahar Okay, that's fine. Just as long as you're not actually rendering the image to the screen. — Ben, Aug 21 '14 at 19:11
@JoeKington, the file is local, on an SSD! I agree that 12 seconds is a long time but I'm not sure what to do about it. As far as `multiprocessing` goes, I'll give it a try but I have not done anything like this in python before so pointer will be much appreciated. — Shahar, Aug 21 '14 at 19:16
@Shahar - Weird! I'm stumped there... (Are you out of memory and swapping, maybe?) I'll put together an example of using `multiprocessing` for this if someone else doesn't beat me to it. It's not too hard, but it's certainly counter-intuitive the first time you use it. — Joe Kington, Aug 21 '14 at 19:23
@Shahar I just tried running a similar code in canopy, and it takes SIGNIFICANTLY longer than from the command line (I don't know why). Try to run your script from the terminal — Ben, Aug 21 '14 at 19:34
@JoeKington, **thanks!** My code takes up 300 MB of real memory with more than 5 GB of free memory so I'm assuming no swapping is being done. — Shahar, Aug 21 '14 at 19:35
@Ben, I am. My code is a `.py` executable called from the terminal. I wouldn't dream of running this from within Canopy. I noted using Canopy's python interpreter as extra information that might have anything to do with the slowness I am experiencing. — Shahar, Aug 21 '14 at 19:36
@Shahar Okay, when you said "This code is run with a Canopy python interpreter" you just mean that you're using the enthought install? But you're running it as `python — Ben, Aug 21 '14 at 19:39

score 3 · Answer 1 · answered Aug 22 '14 at 07:14

As the number of images is very large and you are using imshow, I would suggest a different approach.

create an output file with the desired dimensions and with a blank image (any color does as long as it is not the same as the spine color)
save the figure to template.png
load template.png by using scipy.ndimage.imread
load the image data into an array
covert your data into colors by using colormaps
scale your image to fit the pixel dimensions of the template (scipy.ndimage.zoom)
copy the pixel data into the template
save the resulting image by scipy.ndimage.save
repeat steps 4 - 8 a many times as you need

This will bypass a lot of rendering stuff. Some comments:

step 1 may take quite a lot of fiddling (especially anti-alias may require attention, it is beneficial to have a sharp black/white border at the edges of the spines)
if step 4 is slow (I do not understand why), try numpy.memmap
if you can, try to use a color map which can be produced by simple arithmetic operations form the data (for example, grayscale, grayscale with gamma, etc.), then you can make step 5 faster
if you can live with images where your data is unscaled (i.e. the area used by the original imshow is 1200x1200), you can get rid of the slow scaling operation (step 6); it also helps, if you can downsample by an integer
if you need to resample the images in step 6, you may also check the functions in the cv2 (OpenCV) module, thay may be faster than the more general functions in scipy.ndimage

Performance-wise the slowest operations are 5, 6, and 9. I would expect the function to be able to handle maybe ten arrays per second. Above that the disk I/O will start to be a limiting factor. If the processing step is the limiting factor, I would just start four (assuming there are four cores) copies of the script, each copy having access to a different 2.5 x 10^4 set of images. With a SSD disk this should not cause I/O seek catastrophes.

Only profiling will tell, though.

Shahar · Accepted Answer · 2014-08-24T16:23:04.620

Weird, after a reboot, a solution I don't usually resort to, read time is down to ~0.002 seconds (on average) per file, and render time is ~0.02 seconds. Saving the .png file takes ~2.6 seconds so all in all, each frame takes about 2.7 seconds.

I took @DrV 's advice,

...I would just start four (assuming there are four cores) copies of the script, each copy having access to a different 2.5 x 10^4 set of images. With a SSD disk this should not cause I/O seek catastrophes.

partitioned the files list to 8 sublists and ran 8 instances of my script.

@DrV's comment

Also, your 0.002 s read time for a 5.7 MB file read does not sound realistic if the file is not in the RAM cache, as it would indicate disk read speed of 2.8 GB/s. (Fast SSDs may just reach 500 MB/s.)

made me benchmark the read/write speeds on my laptop (MacBookPro10,1). I used the following code to produce 1000 files with 1200*1200 random floats (4 Bytes) such that each file is 5.8 MB (1200*1200*4 = 5,760,000 Bytes) and then read them one by one, timing the process. THe code is run from the terminal and never takes up more than 50 MB or memory (quite a lot for holding only one data array of 5.8 MB in memory, no?).

The code:

#!/usr/bin/env ipython

import os
from time import time
import numpy as np

temp = 'temp'
if not os.path.exists(temp):
    os.makedirs(temp)
    print 'temp dir created'
os.chdir(temp)

nx = ny = 1200
nof = 1000
print '\n*** Writing random data to files ***\n'
t1 = time(); t2 = 0; t3 = 0
for i in range(nof):
    if not i%10:
        print str(i),
    tt = time()
    data = np.array(np.random.rand(nx*ny), dtype=np.float32)
    t2 += time()-tt
    fn = '%d.bin' %i
    tt = time()
    f = open(fn, 'wb')
    f.write(data)
    f.close
    t3 += time()-tt
print '\n*****************************'
print 'Total time: %f seconds' %(time()-t1)
print '%f seconds (on average) per random data production' %(t2/nof)
print '%f seconds (on average) per file write' %(t3/nof)

print '\n*** Reading random data from files ***\n'
t1 = time(); t3 = 0
for i,fn in enumerate(os.listdir('./')):
    if not i%10:
        print str(i),
    tt = time()
    f = open(fn, 'rb')
    data = np.fromfile(f)
    f.close
    t3 += time()-tt
print '\n*****************************'
print 'Total time: %f seconds' %(time()-t1)
print '%f seconds (on average) per file read' %(t3/(i+1))

# cleen up:
for f in os.listdir('./'):
    os.remove(f)
os.chdir('../')
os.rmdir(temp)

The result:

temp dir created

*** Writing random data to files ***

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 610 620 630 640 650 660 670 680 690 700 710 720 730 740 750 760 770 780 790 800 810 820 830 840 850 860 870 880 890 900 910 920 930 940 950 960 970 980 990 
*****************************
Total time: 25.569716 seconds
0.017786 seconds (on average) per random data production
0.007727 seconds (on average) per file write

*** Reading random data from files ***

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 610 620 630 640 650 660 670 680 690 700 710 720 730 740 750 760 770 780 790 800 810 820 830 840 850 860 870 880 890 900 910 920 930 940 950 960 970 980 990 
*****************************
Total time: 2.596179 seconds
0.002568 seconds (on average) per file read

Just one comment: The time taken by `savefig` is actually render time plus png encoding plus saving. The quick rendering time does not actually involve any pixel operation, it just builds the object structure. Also, your 0.002 s read time for a 5.7 MB file read does not sound realistic if the file is not in the RAM cache, as it would indicate disk read speed of 2.8 GB/s. (Fast SSDs may just reach 500 MB/s.) — DrV, Aug 23 '14 at 15:02
@DrV - I edited my answer and added a small benchmark to test read/write speeds. — Shahar, Aug 24 '14 at 16:24
Change your 1000 into 10000, then you'll see. If you write 1000 files 5.76 MB each, all the blocks written onto the disk will be in the operating system page cache, and the reads will be fast. This memory consumption is not visible as application memory consumption. (Another way to approach this is to open your OS X Activity Monitor, take the Disk tab, and then check the "data written" and "data read" numbers.) With a similar machine (16 GiB RAM) and 10000 files I get 0.0133 s per writes and 0.0141 s per read. Try it yourself! — DrV, Aug 24 '14 at 17:02

Plotting thousands of files with python

2 Answers2