Python input/output optimisation

Question

I think this code takes too long to execute, so maybe there are better ways to do this. I'm not looking for an answer related to parallelising the for loops, or using more than one processor.

What I'm trying to do is to read values from "file" using "np.genfromtxt(file)". I have 209*500*16 of these files. I want to extract the minimum value of the highest 1000 values of the 209 loop, and putting these 500 values in 16 different files. If the files are missing or the data hasn't the adequate size, the info is written to the "missing_all" file.

The questions are:

Is this the best method to open a file?
Is this the best method to write to files?
How can I make this code faster?

Code:

import numpy as np
import os.path

output_filename2 = '/home/missing_all.txt' 
target2          = open(output_filename2, 'w')

for w in range(16):
    group           = 1200 + 50*w
    output_filename = '/home/veto_%s.txt' %(group)
    target          = open(output_filename, 'w')
    for z in range(1,501):
        sig_b = np.zeros((209*300))
        y     = 0
        for index in range(1,210):
            file                 = '/home/BandNo_%s_%s/%s_209.dat' %(group,z,index)
            if not os.path.isfile(file):
                sig_b[y:y+300]   = 0
                y                = y + 300
                target2.write('%s %s %s\n' % (group,z,index))
                continue
            data                 = np.genfromtxt(file)
            if (data.shape[0] < 300):
                sig_b[y:y+300]   = 0
                y                = y + 300
                target2.write('%s %s %s\n' % (group,z,index))
                continue
            sig_b[y:y+300]       = np.sort(data[:,4])[::-1][0:300]
            y                    = y + 300  
        sig_b          = np.sort(sig_b[:])[::-1][0:1000]   
        target.write('%s\n' % (sig_b[-1]))

score 1 · Answer 1 · edited May 23 '17 at 10:29

Profiler

You can use a profiler to figure out what parts of your script take the most time. This way you know exactly what takes the most time and can optimize those lines instead of blindly trying to optimize your code. The time invested to figure out how the profiler works will pay for itself easily later on.

Some possible slow-downs

Here are some guesses, but they really are only guesses.

You open() only 17 files, so it probably doesn't matter how exactly you do this.
I don't know much about writing-performance. Using file.write() seems fine to me.
genfromtextfile probably takes quite a while (depends on your input files), is loadtxt an alternative for you? The docs states you can use it for data without holes.
Using a binary file format instead of text could speed up reading the file.
You sort your array on every iteration. Is there a way to sort it only at the end?
Usually asking the file system something is not very fast, i.e. os.path.isfile(file) is potentially slow. You could try creating a dict of all the children of the parent directory and use that cached version.

Similarly, if most of your files exist, using exceptions can be faster:

try:
    data = np.genfromtxt(file)
except FileNotFoundError: # not sure if this is the correct exception
    sig_b[y:y+300] = 0
    y += 300
    target2.write('%s %s %s\n' % (group,z,index))
    continue

I didn't try to understand your code in detail. Maybe you can reduce the necessary work by using a smarter algorithm?

PS: I like that you try to put all equal signs on the same column. Unfortunately here it makes it harder to read your code.

Python input/output optimisation

1 Answers1

Profiler

Some possible slow-downs