What is the fastest/most efficient way to loop through a large collection of files and save a plot of the data?

Question

So I have this program that is looping through about 2000+ data files, performing a fourier transform, plotting the transform, then saving the figure. It feels like the longer the program runs, the slower it seems to get. Is there anyway to make it run faster or cleaner with a simple change in the code below?

Previously, I had the fourier transform defined as a function, but I read somewhere here that python has high function calling overhead, so i did away with the function and am running straight through now. Also, I read that the clf() has a steady log of previous figures that can get quite large and slow down the process if you loop through a lot of plots, so I've changed that to close(). Where these good changes as well?

from numpy import *
from pylab import *

for filename in filelist:

    t,f = loadtxt(filename, unpack=True)

    dt = t[1]-t[0]
    fou = absolute(fft.fft(f))
    frq = absolute(fft.fftfreq(len(t),dt))

    ymax = median(fou)*30

    figure(figsize=(15,7))
    plot(frq,fou,'k')

    xlim(0,400)
    ylim(0,ymax)

    iname = filename.replace('.dat','.png')
    savefig(iname,dpi=80)
    close()

In a case like this, Python's overhead is going to be completely insignificant - the vast majority of your time will be spent in numpy/pylab calls - which are going to be delegating to efficient lower-level code - it seems likely this is just a case of expensive operations you are performing. — Gareth Latty, May 09 '14 at 19:55
This isn't directly related to your question, but in general doing `from import *` isn't a good idea. It makes it hard to tell where functions you're calling are defined, and potentially could cause conflicts between function names that you won't know about. — dano, May 09 '14 at 19:55
@Lattyware Yeah, the files are all of varying size, from a couple Mb, to tens of Mb, so the fourier transform can be quite intensive. I just wasn't sure if changing some plotting functions around would help. — Tim, May 09 '14 at 20:06
@dano Oh yeah, don't worry I didn't do that in my code. It was just easier to type that up than to write out every function i imported :) — Tim, May 09 '14 at 20:07
and `pylab` is a bad idea, import from `numpy` and `matplotlib.pyplot` directly (and `pyplot` is not a great idea for scripts, use the OO interface directly). — tacaswell, May 09 '14 at 20:52

dano · Accepted Answer · 2014-05-09T20:36:20.440

4

Have you considered using the multiprocessing module to parallelize processing the files? Assuming that you're actually CPU-bound here (meaning it's the fourier transform that's eating up most of the running time, not reading/writing the files), that should speed up execution time without actually needing to speed up the loop itself.

Edit:

For example, something like this (untested, but should give you the idea):

def do_transformation(filename)
    t,f = loadtxt(filename, unpack=True)

    dt = t[1]-t[0]
    fou = absolute(fft.fft(f))
    frq = absolute(fft.fftfreq(len(t),dt))

    ymax = median(fou)*30

    figure(figsize=(15,7))
    plot(frq,fou,'k')

    xlim(0,400)
    ylim(0,ymax)

    iname = filename.replace('.dat','.png')
    savefig(iname,dpi=80)
    close()

pool = multiprocessing.Pool(multiprocessing.cpu_count())
for filename in filelist:
    pool.apply_async(do_transformation, (filename,))
pool.close()
pool.join()

You may need to tweak what work actually gets done in the worker processes. Trying to parallelize the disk I/O portions may not help you much (or even hurt you), for example.

edited May 09 '14 at 20:36

answered May 09 '14 at 19:55

dano

91,354
19
222
219

hmmmm. Could you elaborate a little more? I'm in a bit of a time crunch with this program. It seems at the rate I'm going, it looks like another another day or two before finishing, and I'm really shooting for 12-18 hours. – Tim May 09 '14 at 20:10
I just selected your comment as the answer, since it's effectively sped up the program almost 8x (the number of cpu's). But if you could, I have another question. There are a handful of files here that are quite substantial, taking a quite a long time to process. Is there a way to assign multiple processors to the same task instead of applying them to separate files? – Tim May 10 '14 at 02:51
There's no simple tweak to say "Throw more CPUs at this task". You'd need to refactor the code to break your worker method up into smaller pieces that multiple processes can work on at the same time, and then pull it back together once all the pieces are ready. For example, it looks like `fou = absolute(...` and `frq = absolute(...` could be calculated in parallel. You have to be careful though, because passing large amounts of data between processes can be slow. Its hard for me to say exactly what kind of changes you could make because I really don't understand the algorithms you're using. – dano May 10 '14 at 03:55

score 1 · Answer 2 · edited May 23 '17 at 12:05

1

Yes, adding close was a good move. It should help plug the memory leak you had. I'd also recommend moving the figure, plotting, and close commands outside the loop - just update the Line2D instance created by plot. Check out this for more info.

Note: I think this should work, but I haven't tested it here.

edited May 23 '17 at 12:05

Community

1
1

answered May 09 '14 at 19:55

AMacK

1,396
9
9

Does this method save a log of past plots like `clf()` does? – Tim May 09 '14 at 20:11
I think I meant the memory leak you referenced. I just skimmed over something previously and thought I read that clearing a plot saves the history (or something of that nature) of that plot which led to the memory leak, but I'm not quite sure. – Tim May 09 '14 at 20:20
Ah, ok. Take a look at [this](http://stackoverflow.com/a/8862575/2457474). Essentially, clf() doesn't remove internal references to the figure so it can't go away - i.e., you create ~2K figures AND keep them in memory. The way I suggested just creates one figure and one Line2D object. You update the line points as you go. – AMacK May 09 '14 at 20:28
That makes sense. So I'll cut out the time on producing a new figure each go around. Quick question. My program just crashed due to an overflow error. Too many data points I suppose. How would I go about preventing this without changing the number of data points plotted? – Tim May 09 '14 at 21:32
Actually, just kidding. I think I know how to get around this problem. – Tim May 09 '14 at 21:34

score 0 · Answer 3 · answered May 09 '14 at 19:58

I tested something similar to what you are doing in ipython and I noticed that the loop got considerably slower when a directory had a lot of files in it. It seems like the file system in that directory has an overhead relating to the number of files in that folder, maybe relating to the lookup time of:

loadtxt(filename, unpack = true)

You could try splitting where you save your plots into chuncks by splitting your filelist into smaller chunks and saving in a different directory for each one.

Yeah, I think I'll definitely remember that for future reference, but right now I'm hard pressed for time, so I can't make the change. :( — Tim, May 09 '14 at 20:17

What is the fastest/most efficient way to loop through a large collection of files and save a plot of the data?

3 Answers3

Linked