Optimizing python code - filtering numpy arrays

Question

I have a big number of coordinates in 2 arrays (HLat22 and HLong22) and I've also got a LineString. The output is in indices - there is an array full of True/False that shows me the coordinates in HLat22/HLong22 that are in a certain threshold to the coordinates on my LineString ( my example is 0.005) . On the example that I have, on the 4th position the coordinates are near my LineString.

For the filtering function I've used the function from this post: Selecting close matches from one array based on another reference array

def searchsorted_filter(a, b, thresh):
    choices = np.sort(b) # if b is already sorted, skip it
    lidx = np.searchsorted(choices, a, 'left').clip(max=choices.size-1)
    ridx = (np.searchsorted(choices, a, 'right')-1).clip(min=0)
    cl = np.take(choices,lidx) # Or choices[lidx]
    cr = np.take(choices,ridx) # Or choices[ridx]
    return a[np.minimum(np.abs(a - cl), np.abs(a - cr)) < thresh]

from shapely.geometry import LineString, Point, LinearRing
import time
import numpy as np

start_time = time.time()
HLat22 = np.asarray([100,200,300,32.47156,500,600,700,800,900,1000])
HLong22 = np.asarray([-100,-200,-300,-86.79192,-500,-600,-700,-800,-900,-1000])
polygon2 = LineString ([Point(-86.79191,32.47155), Point(-86.78679699999999,32.47005)])

#Getting lat and long coordinates
numpy_x = np.array(polygon2.coords.xy[0])
numpy_y = np.array(polygon2.coords.xy[1])

#Filtering so I only remain with coordinates 
The_X = searchsorted_filter(HLong22,numpy_x,thresh=0.005)
The_Y = searchsorted_filter(HLat22,numpy_y,thresh=0.005)

print("Secsfilter: %s",time.time()-start_time)
start_time = time.time()
indices = np.in1d(HLong22, The_X) & np.in1d(HLat22, The_Y)
print("Secsin1d: %s",time.time()-start_time)

Output:

Secsfilter: %s 0.002005338668823242
Secsin1d: %s 0.0 
array([False, False, False,  True, False, False, False, False, False, False], dtype=bool)

This works fine. However, with bigger outputs it starts going slower. If my HLat2/Hlong2 are 1413917 in size ( same LineString ), this is how it acts:

Secsfilter: %s 0.20999622344970703
Secsin1d: %s 0.49498486518859863

The_X and The_Y's length would be 15249 .

The question that I have is: Is there any way to optimize this code and make it a bit faster?

can you give some explanation of your searchsorted function? — Tai, Jan 06 '18 at 20:19
Where does this task come from? I'm wondering if that's really what you need or already some attempt to speed something up (e.g. get all within distance vs. get all where decomposed single-dim distances are within range)? — sascha, Jan 06 '18 at 20:27
You are right. My task is to get all coordinates within distance of the linestring. After I filter the coordinates so there are only coordinates in 0.005 lat/long degrees to my linestring, I calculate the closest point on the line string to each of my remaining coordinates and then just use haversine on the closest point and my coordinate. Those work really fast though, I have no problems with them. — Barkz, Jan 06 '18 at 20:35
Trying to better optimize a post from Divakar will be hard to do :). — alkasm, Jan 06 '18 at 20:37
@Tai I am also not sure how to explain it properly but I tested it and I'm not sure if I can find something faster. To optimize it I think I might have to change the way my data is being held so I don't use that type of filtering in the same place or if there is anything faster than numpy.in1d ? — Barkz, Jan 06 '18 at 20:40

sascha · Accepted Answer · 2018-01-06T21:11:48.877

Often algorithmics beat low-level optimizations (e.g. binary-search vs. linear-search; the former better for big n; the latter better for small n).

Without much experience with this area and totally ignoring the numbers you gave, here some demo you should try out! You will have to do your own benchmarks tailored for your task (and tune the parameters available)!

The idea is:

use metric-trees which are specialized data-structures for nearest-neighbor-like searches in some metric space
- here: ball-tree (of sklearn) supporting the haversine metric
- supports queries like:
  - get k nearest neighbors
  - get all neighbors within distance <=x

Code:

from sklearn.neighbors import BallTree
import numpy as np

Coords = np.array([[51.165691, 10.451526],  # GER
                   [40.463667, -3.74922],   # ESP
                   [61.52401, 105.318756]]) # RUS
print(Coords)

polygon2 = np.array([[52.520008, 13.404954],   # BERLIN
                     [55.751244, 37.618423]])  # MOSCOW
print(polygon2)

# BUILD TREE for LOOKUP
tree = BallTree(Coords, metric='haversine')

# QUERY NEAREST NEIGHBORS
print('\nnearest neighbor search')
dist, ind = tree.query(polygon2, k=1)
print('dist: ', dist)
print('indices: ', ind)     

# QUERY FOR DISTANCE <= X
print('\nradius search')
ind = tree.query_radius(polygon2[0][np.newaxis], 0.15)
print('indices: ', ind)

Output

[[  51.165691   10.451526]
 [  40.463667   -3.74922 ]
 [  61.52401   105.318756]]
[[ 52.520008  13.404954]
 [ 55.751244  37.618423]]

nearest neighbor search
dist:  [[ 0.11852066]
 [ 0.76816021]]
indices:  [[0]
 [2]]

radius search
indices:  [array([0], dtype=int64)]

Yes, this looks great! I've tested it a bit and seems to be really fast. One quick note: for those trying this method, you need to convert lat/long to radians for the output to be correct in radians . — Barkz, Jan 06 '18 at 22:14

Optimizing python code - filtering numpy arrays

1 Answers1

Code:

Output