1

I have the following block of code:

def hasCleavage(tags, pair, fragsize):
    limit = int(fragsize["mean"] + fragsize["sd"] * 4)
    if pair.direction == "F1R2" or pair.direction == "R2F1":
        x1 = np.where((tags[pair.chr_r1] >= pair.r1["pos"]) & (tags[pair.chr_r1] <= pair.r1["pos"]+limit))[0]
        x2 = np.where((tags[pair.chr_r2] <= pair.r2["pos"]+pair.frside) & (tags[pair.chr_r2] >= pair.r2["pos"]+pair.frside-limit))[0]
    elif pair.direction == "F1F2" or pair.direction == "F2F1":
        x1 = np.where((tags[pair.chr_r1] >= pair.r1["pos"]) & (tags[pair.chr_r1] <= pair.r1["pos"]+limit))[0]
        x2 = np.where((tags[pair.chr_r2] >= pair.r2["pos"]) & (tags[pair.chr_r2] <= pair.r2["pos"]+limit))[0]
    elif pair.direction == "R1R2" or pair.direction == "R2R1":
        x1 = np.where((tags[pair.chr_r1] <= pair.r1["pos"]+pair.frside) & (tags[pair.chr_r1] >= pair.r1["pos"]+pair.frside-limit))[0]
        x2 = np.where((tags[pair.chr_r2] <= pair.r2["pos"]+pair.frside) & (tags[pair.chr_r2] >= pair.r2["pos"]+pair.frside-limit))[0]
    else: #F2R1 or R1F2
        x1 = np.where((tags[pair.chr_r2] >= pair.r2["pos"]) & (tags[pair.chr_r2] <= pair.r2["pos"]+limit))[0]
        x2 = np.where((tags[pair.chr_r1] <= pair.r1["pos"]+pair.frside) & (tags[pair.chr_r1] >= pair.r1["pos"]+pair.frside-limit))[0]
    if x1.size > 0 and x2.size > 0:
        return True
    else:
        return False

My script takes 16 minutes to finish. It calls hasCleavage millions of times, one time per row reading a file. When I add above the variable limit a return True (preventing calling np.where), the script takes 5 minutes.

tags is a dictionary containing numpy arrays with ascending numbers.

Do you have any suggestions to improve performance?

EDIT:

tags = {'JH584302.1': array([   351,   1408,   2185,   2378,   2740,   2904,   3364,   3657,
         4240,   5324,   5966,   5977,   5986,   6488,   6531,   6847,
         6961,   6973,   6991,   7107,   7383,   7395,   7557,   7569,
         9178,  10077,  10456,  10471,  11271,  11466,  12311,  12441,
        12598,  13051,  13123,  13859,  14167,  14672,  15156,  15252,
        15268,  15273,  15694,  15786,  16361,  17073,  17293,  17454])
}
fragsize = {'sd': 130.29407997430428, 'mean': 247.56636}

And pair is an object of a custom class <__main__.Pair object at 0x17129ad0>

user2886545
  • 711
  • 2
  • 8
  • 18
  • 3
    You need to give us more information on the inputs to this function. We are too lazy to deduce it just from the code. A small sample would be nice as well. Again we are too lazy to concoct a test case ourselves. – hpaulj Jun 13 '17 at 17:13
  • 2
    `np.where` constructs a new, potentially huge array. Given that you only need the size of that resulting array, this is probably not what you want, you could try iterating over the array and checking for the condition or using some other NumPy function instead. – ForceBru Jun 13 '17 at 17:31
  • A similar question has been answered here: https://stackoverflow.com/questions/33281957/faster-alternative-to-numpy-where – GuyTal Jul 26 '18 at 00:24

0 Answers0