numpy vectorizing a function slows it down?

Question

I'm writing a program in which I'm trying to see how well a given redshift gets a set of lines detected in an spectrum to match up to an atomic line database. The closer the redshift gets the lines to overlap, the lower the "score" and the higher the chance that the redshift is correct.

I do this by looping over a range of possible redshifts, calculating the score for each. Within that outer loop, I was looping within each line in the set of detected lines to calculate its sub_score, and summing that inner loop to get the overall score.

I tried to vectorize the inner loop with numpy, but surprisingly it actually slowed down the execution. In the example given, the nested for loop takes ~2.6 seconds on my laptop to execute, while the single for loop with numpy on the inside takes ~5.3 seconds.

Why would vectorizing the inner loop slow things down? Is there a better way to do this that I'm missing?

import numpy as np
import time

def find_nearest_line(lines, energies):
    # Return the indices (from the lines vector) that are closest to the energies vector supplied
    # Vectorized with help from https://stackoverflow.com/a/53111759
    energies = np.expand_dims(energies, axis=-1)
    idx = np.abs(lines / energies - 1).argmin(axis=-1)
    return idx

def calculate_distance_to_line(lines, energies):
    # Returns distance between an array of lines and an array of energies)
    z = (lines / energies) - 1
    return z

rng = np.random.default_rng(2021)
det_lines = rng.random(1000)
atom_lines = rng.random(10000)

redshifts = np.linspace(-0.1, 0.1, 100)

# loop version
start=time.time()
scores = []
for redshift in redshifts:
    atom_lines_shifted = atom_lines * (1 + redshift)
    score = 0
    for det_line in det_lines:
        closest_atom_line = find_nearest_line(atom_lines_shifted, det_line)
        score += abs(calculate_distance_to_line(atom_lines_shifted[closest_atom_line], det_line))
    scores.append(score)
print(time.time()-start)
print(scores)

# (semi)-vectorized version
start=time.time()
scores = []
for redshift in redshifts:
    atom_lines_shifted = atom_lines * (1 + redshift)
    closest_atom_lines = find_nearest_line(atom_lines_shifted, det_lines)
    score = np.sum(np.abs(calculate_distance_to_line(atom_lines_shifted[closest_atom_lines], det_lines)))
    scores.append(score)
print(time.time()-start)
print(scores)

Yes, sometimes a modest number of iterations is faster than one call that uses larger arrays. Vectorized calculations can create large intermediate arrays. Memory management of those arrays can eat into the time savings from just one call. For cases like this I like to do some timings on smaller examples, as well as on large ones. — hpaulj, Dec 31 '21 at 21:06

Jérôme Richard · Accepted Answer · 2022-01-01T10:42:08.510

Numpy codes generally creates many temporary arrays. This is the case for your function find_nearest_line for example. Working on all the items of det_lines simultaneously would results in the creation of many relatively big arrays (1000 * 10_000 * 8 = 76 MiB per array). The thing is big array often do not fit in CPU caches. If so, the array needs to be stored in RAM with a much lower throughput and much higher latency. Moreover, allocating/freeing bigger array takes more time and results often in more page faults (due to the actual implementation of most default standard allocators). It is sometimes faster to use big array because the overhead of the CPython interpreter is huge but both strategies are inefficient in practice.

The thing is that the algorithm is not efficient. Indeed, you can sort the array and use a binary search to find the closest value much more efficiently. np.searchsorted does most of the work but it only returns the index of the closest value greater (or equal) than the target value. Thus, there is some additional operation to do to get the closest value (possibly greater or lesser than the target value). Note that this algorithm do not generate huge array thanks to the binary search.

scores = []
n = atom_lines.size
m = det_lines.size
line_idx = np.arange(m)

for redshift in redshifts:
    atom_lines_shifted = atom_lines * (1 + redshift)
    sorted_atom_lines_shifted = np.sort(atom_lines_shifted)
    close_idx = np.searchsorted(sorted_atom_lines_shifted, det_lines)
    lower_bound = sorted_atom_lines_shifted[np.maximum(close_idx - 1, 0)]
    upper_bound = sorted_atom_lines_shifted[np.minimum(close_idx, n - 1)]
    bounds = np.hstack((lower_bound[:, None], upper_bound[:, None]))
    closest_bound_idx = find_nearest_line(bounds, det_lines)
    close_values = bounds[line_idx, closest_bound_idx]
    score = np.sum(np.abs(calculate_distance_to_line(close_values, det_lines)))
    scores.append(score)

Since atom_lines is not modified and the multiplication preserve the order, the algorithm can be further optimized by sorting atom_lines directly:

scores = []
n = atom_lines.size
m = det_lines.size
line_idx = np.arange(m)
sorted_atom_lines = np.sort(atom_lines)

for redshift in redshifts:
    sorted_atom_lines_shifted = sorted_atom_lines * (1 + redshift)
    close_idx = np.searchsorted(sorted_atom_lines_shifted, det_lines)
    lower_bound = sorted_atom_lines_shifted[np.maximum(close_idx - 1, 0)]
    upper_bound = sorted_atom_lines_shifted[np.minimum(close_idx, n - 1)]
    bounds = np.hstack((lower_bound[:, None], upper_bound[:, None]))
    closest_bound_idx = find_nearest_line(bounds, det_lines)
    close_values = bounds[line_idx, closest_bound_idx]
    score = np.sum(np.abs(calculate_distance_to_line(close_values, det_lines)))
    scores.append(score)

This last implementation is about 300 times faster on my machine.

One can also use the answer proposed in https://stackoverflow.com/a/46184652 to use the searchsorted() algorithm in the manner you proposed. Regardless, this has a huge speedup on my machine as well. — efremdan1, Jan 02 '22 at 00:38

numpy vectorizing a function slows it down?

1 Answers1