Find closest line to each point on big dataset, possibly using shapely and rtree

Question

I have a simplified map of a city that has streets in it as linestrings and addresses as points. I need to find closest path from each point to any street line. I have a working script that does this, but it runs in polynomial time as it has nested for loop. For 150 000 lines (shapely LineString) and 10 000 points (shapely Point), it takes 10 hours to finish on 8 GB Ram computer.

The function looks like this (sorry for not making it entirely reproducible):

import pandas as pd
import shapely
from shapely import Point, LineString

def connect_nodes_to_closest_edges(edges_df , nodes_df,
                                   edges_geom,
                                   nodes_geom):
    """Finds closest line to points and returns 2 dataframes:
        edges_df
        nodes_df
    """
    for i in range(len(nodes_df)):
        point = nodes_df.loc[i,nodes_geom]
        shortest_distance = 100000
        for j in range(len(edges_df)):
            line = edges_df.loc[j,edges_geom]
            if line.distance(point) < shortest_distance:
                shortest_distance = line.distance(point)
                closest_street_index = j
                closest_line = line
                ...

Then I save the results in a table as a new column that adds the shortest path from point to line as a new column.

Is there a way to make it faster with some addition to the function?

If I could for example filter out lines for every point that are 50m away or so, it would help to speed every iteration?

Is there a way to make this faster using rtree package? I was able to find an answer that makes the script for finding intersection of polygons faster, but I can't seem to make it working for closest point to line.

Faster way of polygon intersection with shapely

https://pypi.python.org/pypi/Rtree/

Sorry if this was already answered, but I didn't find an answer here nor on gis.stackexchange

thank you for advice!

You can try the code in [this question](https://stackoverflow.com/a/45573790/6517541) to see how much speed up you get by filtering out far away links. Another approach that you can try is to get center point of linestrings(using ''shapely interpolate'), then use rtree to find candidate links (Point to point search), and then calculate distance. — Alz, Sep 12 '17 at 19:36
@boardrider thanks for the links, I edited the code. Unfortunately, it's hard to reproduce such big dataset here. — StefanK, Sep 13 '17 at 07:04
That's where the _Minimal_ in [mcve] comes into play. I looked at the code, and it still does not look like a _Verifiable example_ — boardrider, Sep 13 '17 at 17:42
Do you still need help on this? If so, we would need some information about the data. 1) How many points do you have approximately in each LineString? 2) Is it a computation that you need to do once, or you need a solution that works for new incoming points and LineStrings? What I would do is to build my own "rtree-like" data structure, but rather than build it on the (probably long) LineStrings, I would use the segments of all the LineStrings. You could also use rtree library directly by paying with some extra computation. — eguaio, Sep 19 '17 at 17:57
some good tips there! What I do now is break the street lines into smallest segments of only 2 points and then run the nested for loop with saving intermediate results into lists from which I make a dataframe at the end of the loop. Even though I need to run it for chosen area only one time, once the area becomes big (some big city), the calculation would run for 3 days at the current setting.Can you give me some hints about building my own r-tree data structure? — StefanK, Sep 20 '17 at 08:00

eguaio · Accepted Answer · 2017-09-21T11:50:10.023

Here you have a solution using rtree library. The idea is to build boxes that contain the segments in the diagonal, and use that boxes to build the rtree. This will be the most time-expensive operation. Later, you query the rtree with a box centered in the point. You get several hits that you need to check for the minimum, but the number of hits will be (hopefuly) orders of magnitud lower than checking against all the segments.

In the solutions dict you will get, for each point, the line id, the nearest segment, the nearest point (a point of the segment), and the distance to the point.

There are some comments in the code to help you. Take into account you can serialize the rtree for later use. In fact, I would recomend to build the rtree, save it, and then use it. Because the exceptions for the adjustments of the constants MIN_SIZE and INFTY will probably raise, and you would not want to lose all the computation you did building the rtree.

A too small MIN_SIZE will mean you could have errors in the solutions because if the box around the point does not intersect a segment, it could intersect a box of a segment that is not the nearest segment (it is easy to think a case).

A too big MIN_SIZE would mean to have too much false positives, that in the extreme case would make the code to try with all the segments, and you will be in the same position as before, or worst, because you are now building an rtree you don't really use.

If the data is real data from a city, I imagine you know that any address will be intersecting a segment with a distance smaller than a few blocks. That will make the search practically logaritmic.

One more comment. I'm assuming that there are not segments that are too large. Since we are using the segments as the diagonals of the boxes in the rtree, if you have some large segments in a line, this would mean a huge box will be assigned to that segment, and all addresses boxes would intersect it. To avoid this, you can always increase artificially the resolution of the LineStrins by adding more intermediate points.

import math
from rtree import index
from shapely.geometry import Polygon, LineString

INFTY = 1000000
MIN_SIZE = .8
# MIN_SIZE should be a vaule such that if you build a box centered in each 
# point with edges of size 2*MIN_SIZE, you know a priori that at least one 
# segment is intersected with the box. Otherwise, you could get an inexact 
# solution, there is an exception checking this, though.


def distance(a, b):
    return math.sqrt( (a[0]-b[0])**2 + (a[1]-b[1])**2 ) 

def get_distance(apoint, segment):
    a = apoint
    b, c = segment
    # t = <a-b, c-b>/|c-b|**2
    # because p(a) = t*(c-b)+b is the ortogonal projection of vector a 
    # over the rectline that includes the points b and c. 
    t = (a[0]-b[0])*(c[0]-b[0]) + (a[1]-b[1])*(c[1]-b[1])
    t = t / ( (c[0]-b[0])**2 + (c[1]-b[1])**2 )
    # Only if t 0 <= t <= 1 the projection is in the interior of 
    # segment b-c, and it is the point that minimize the distance 
    # (by pitagoras theorem).
    if 0 < t < 1:
        pcoords = (t*(c[0]-b[0])+b[0], t*(c[1]-b[1])+b[1])
        dmin = distance(a, pcoords)
        return pcoords, dmin
    elif t <= 0:
        return b, distance(a, b)
    elif 1 <= t:
        return c, distance(a, c)

def get_rtree(lines):
    def generate_items():
        sindx = 0
        for lid, l in lines:
            for i in xrange(len(l)-1):
                a, b = l[i]
                c, d = l[i+1]
                segment = ((a,b), (c,d))
                box = (min(a, c), min(b,d), max(a, c), max(b,d)) 
                #box = left, bottom, right, top
                yield (sindx, box, (lid, segment))
                sindx += 1
    return index.Index(generate_items())

def get_solution(idx, points):
    result = {}
    for p in points:
        pbox = (p[0]-MIN_SIZE, p[1]-MIN_SIZE, p[0]+MIN_SIZE, p[1]+MIN_SIZE)
        hits = idx.intersection(pbox, objects='raw')    
        d = INFTY
        s = None
        for h in hits: 
            nearest_p, new_d = get_distance(p, h[1])
            if d >= new_d:
                d = new_d
                s = (h[0], h[1], nearest_p, new_d)
        result[p] = s
        print s

        #some checking you could remove after you adjust the constants
        if s == None:
            raise Exception("It seems INFTY is not big enough.")

        pboxpol = ( (pbox[0], pbox[1]), (pbox[2], pbox[1]), 
                    (pbox[2], pbox[3]), (pbox[0], pbox[3]) )
        if not Polygon(pboxpol).intersects(LineString(s[1])):  
            msg = "It seems MIN_SIZE is not big enough. "
            msg += "You could get inexact solutions if remove this exception."
            raise Exception(msg)

    return result

I tested the functions with this example.

xcoords = [i*10.0/float(1000) for i in xrange(1000)]
l1 = [(x, math.sin(x)) for x in xcoords]
l2 = [(x, math.cos(x)) for x in xcoords]
points = [(i*10.0/float(50), 0.8) for i in xrange(50)]

lines = [('l1', l1), ('l2', l2)]

idx = get_rtree(lines)

solutions = get_solution(idx, points)

And got:

Thanks for putting so much work into this! I will have time to test this solution later next week, so I will mark it as a solution after. — StefanK, Sep 23 '17 at 10:12
You're welcome. I found the problem itself very interesting and with potential for many applications, so I wanted to give it a try. Looking forward for the performance results. — eguaio, Sep 23 '17 at 12:27
Come on my friend, the suspense is killing me, hahaha. You know, I love this kind of performance tests, I couldn't wait to try the new code as you are doing it. — eguaio, Sep 27 '17 at 13:33
Sorry for not responding earlier... I tried to implement this, but found it quite difficult as I would have to change a lot in my original solution - write it anew. So instead I tried filtering out points that are not in a square of 100meters from the center point. I created a column for x point, for y point and in square column that has True value if the point is in 100m square. This step drastically decreased the total computation time. Even though rtree would be faster, it was possible to use my simplified solution in practice, so we stayed with it. But good job anyway and thanks for effort — StefanK, Dec 01 '17 at 11:15
Sad to hear I will not get the performance comparison. Thank you for sharing your solution and for answering back anyway :) — eguaio, Dec 04 '17 at 12:26
By the way, thank you very much for the upvotes, it is clear to me it was you :) — eguaio, Dec 04 '17 at 16:51

score 1 · Answer 2 · answered Oct 03 '20 at 23:36

I've been searching for a solution, and I found this, which uses Geopandas. Basically, it is a straightforward approach, which considers the overlapping of the bounding boxes of points and lines. However, thanks to the spatial index, the computational cost is significantly reduced.

Find closest line to each point on big dataset, possibly using shapely and rtree

2 Answers2

Linked