0

So I have the following dataframes (simplified)

    df1 = propslat    prosplong     type
           50     45       prosp1
           34      -25     prosp2


    df2 = complat     complong     type
           29      58      competitor1
           68      34      competitor2

I want to do the following - run a distance calculation for each individual prospects (740k prospects in total) between that prospect and every competitor so theoretically the output would look like the following:

    df3 = d_p(x)_to_c1         d_p(x)_to_c2      d_p(x)_to_c3
          234.34                895.34            324.5

where every row of the output is a new prospect.

My current code is the following:

    prospectsarray=[]

    prosparr = []



    for i, row in prospcords.iterrows():
        lat1 = row['prosplat']
        lon2 = row['prosplong']
        coords= [lat1,lon2]
        distancearr2 = []

        for x, row2 in compcords.iterrows():
            lat2 = row2['complat']
            lon2 = row2['complong']
            coords2 = [lat2,lon2]
            distance = geopy.distance.distance(coords, coords2).miles
            if distance > 300:
                distance = 0

            distancearr2.append(distance)
        prosparr.append(distancearr2)
    prospectsarray.extend(prosparr)
    dfprosp = pd.DataFrame(prospectsarray)

While this accomplished my goal, it is horrendously slow.

I have tried the following optimization, but the output is not iterating and still I am using an iterrows which is what I was trying to avoid.

    competitorlist = []
    def distancecalc(df):
        distance_list = []
        for i in range(0, len(prospcords)):
            coords2 = [prospcords.iloc[i]['prosplat'],prospcords.iloc[i]['prosplong']]
            d = geopy.distance.distance(coords1,coords2).miles
            print(d)
            if d>300:
                d=0
            distance_list.append(d)
        competitorlist.append(distance_list)




    for x, row2 in compcords.iterrows():
        lat2 = row2['complat']
        lon2 = row2['complong']
        coords1 = [lat2,lon2]
        distancecalc(prospcords)
        print(distance_list)
godot
  • 1,550
  • 16
  • 33
  • It's not too difficult to use apply here but first you shouldn't use a global value for coords1, I would pass it to the `distancecalc` function... – godot Dec 01 '18 at 22:59
  • Can you elaborate on what you mean here? I'm not sure I fully understand your suggestion – Victor Nogueira Dec 01 '18 at 23:15
  • You should absolutely use camel case or something similar for you compound names & also there is something wrong with your append then extend instructions. You should probably review that... – godot Dec 01 '18 at 23:55

3 Answers3

1

My guess is that most of the execution time is spent in geopy.distance.distance(). You can confirm this by using cProfile or some other timing tool.

According to the geopy documentation on distance, it calculates the geodesic distance between two points, using an ellipsoidal model of the Earth. It appears that this algorithm is very accurate: they compare it to a deprecated algorithm that is "only accurate to 0.2 mm". My guess is the geodesic distance is a bit time-consuming.

They also have a function great_cirlce (geopy.distance.great_circle) which uses a spherical model of the Earth. Because the Earth is not a true sphere, this will have "an error of up to about 0.5%." So, if the actual distance is 100 (miles/Km), it could be off by as much as a half mile/Km. Again, just guessing, but I suspect this algorithm is faster than the geodesic algorithm.

If you can tolerate the potential errors in your application, try using great_circle() instead of distance()

Rich Holton
  • 662
  • 5
  • 12
  • I can tolerate the slight difference of up to 1.5 miles off. I will try this approach in about 30 minutes. I used the time it function to find the time increases as more values are appended to the dlist. Would I wrap the code in c profile function? – Victor Nogueira Dec 01 '18 at 23:17
  • This answer https://stackoverflow.com/a/582337/3119180 by Chris Lawlor gives a good overview. – Rich Holton Dec 01 '18 at 23:24
0

First of all, you should be careful about the information you're giving. The dataframes column names you give are not compatible with your code... Also a few explanations would be great about what you are trying to do.

Anyway, here is my solution:

import pandas as pd
from geopy import distance

compCords = pd.DataFrame(
{'compLat': [20.0, 13.0, 14.0], 'compLong': [-15.0, 5.0, -1.2]})
prospCords = pd.DataFrame(
{'prospLat': [21.0, 12.1, 13.0], 'prospLong': [-14.0, 2.2, 2.0]})


def distanceCalc(compCoord):
    # return the list of result instead of using append() method
    propsDist = prospCords.apply(
        lambda row: distance.distance(
            compCoord, [
                row['prospLat'], row['prospLong']]).miles, axis=1)
    # clean data in a pandas Series
    return propsDist.apply(lambda d: 0. if d > 300 else d)

# Here too return the list through the output
compDist = compCords.apply(lambda row: distanceCalc(
    [row['compLat'], row['compLong']]), axis=1)

dfProsp = pd.DataFrame(compDist)

Note: your problem is that when you use things like apply and functions you should think in a "functional" way: pass most of things you need through inputs and outputs of your functions and do not use tricks like appending elements to global list variables through append or extend functions because those are "side effects" and side effects are not getting along great with functional programming concept like apply function (or 'map' as it is usually called in functional programming).

godot
  • 1,550
  • 16
  • 33
  • Euclidean distance won't work here." The tool seems to be just calculating the Euclidean distance between the two points (the square root of the sum of the squared differences between the coordinates). This doesn't make any sense for latitudes and longitudes, which are not coordinates in a Cartesian coordinate system. Not only is this number not a meaningful distance, but it no longer contains the information required to reconstruct a distance from "https://math.stackexchange.com/questions/29157/how-do-i-convert-the-distance-between-two-lat-long-points-into-feet-meters – Victor Nogueira Dec 01 '18 at 23:55
  • I think you misunderstood the purpose. I just implemented a fake random distance function so I could compute a number without having to use (and import) the geopy one. You can now use this code and replace it with the distance function of your choice ;). I was just trying to answer your request about how to use apply instead of iterrows. – godot Dec 02 '18 at 00:05
  • @VictorNogueira if you could tell me how much using `apply` improve your code, I would be interested... Thx! – godot Dec 02 '18 at 14:44
  • Hi godot I actually ended up using a threading approach rather than apply and additionally I followed @rich Holton solution of using geopy.distance.greatcircle rather tha distance.distance. I can go back through when I have more free time on Thursday and time both approaches for a subset of calculations. Using threading and geopgy.distance.greatcircle improved performance dramatically as I was able to run all my calculations in less than one day, whereas previously with nested itterows and distance.distance it took about 48 hours or 2 dayz – Victor Nogueira Dec 03 '18 at 15:08
  • Hi @ godot I actually ended up finding this solution to be near instaneous! Numpy arrays are powerful! – Victor Nogueira Dec 18 '18 at 03:30
0

Here is the fastest solutin I could make!

compuid=np.array(df.iloc[0:233,0])
complat = np.array(df.iloc[0:233,3])
complong = np.array(df.iloc[0:233,4])
custlat=np.array(df.iloc[234:,3])
custlong=np.array(df.iloc[234:,4])


ppmmasterlist=[]
mergedlist=[]
for x,y in np.nditer([custlat,custlong]):

    """
    Taking the coords1 from the numpy array's using x,y
    as index and calling those into the coords1 list.
    """
    coords1=[x,y]
    """
    Instatiating Distance collection list
    and List greater than 0
    As well as the pipeline list
    """
    distcoll=[]
    listGreaterThan0=[]
    ppmlist=[]
    ppmdlist=[]
    z=0
    for p,q in np.nditer([complat,complong]):
        """
        Taking the coords2 from the numpy array's using p,q
        as index and calling those into the coords1 list.
        """
        coords2=[p,q]
        distance = great_circle(coords1,coords2).miles
        if distance>= 300:
            distance=0
            di=0
        elif distance <300:
            di=((300-distance)/300)
            distcoll.append(distance)
            distcoll.append(compuid[z])
        if di > 0:
            listGreaterThan0.append(di)
            listGreaterThan0.append(compuid[z])
        if z >= 220:
            ppmlist.append(di)
            ppmdlist.append(distance)
        z+=1
    sumval=[sum(ppmlist)]
    sumval1 = [sum(listGreaterThan0[::2])]
    mergedlist = ppmlist+sumval+ppmdlist+sumval1+listGreaterThan0
    mergedlist.extend(distcoll)
    #rint(mergedlist)
    #ppmmasterlist += [mergedlist]
    ppmmasterlist.append(mergedlist)

df5 = pd.DataFrame(ppmmasterlist)