Find the dealer nearest to the given customer location when the dealer data set has 25k+ records and customer has 200k+ records using Python

Question

I have two tables - Dealers and Customers. For each of the customer location from the customer table, I need to find the closest located dealer from the dealer table.

I have a code that works but takes a few hours to run. I need help in optimizing my solution.

The dealer table has 25k+ rows and customer table has 200k+ rows. Both tables have 3 main columns: (DealerID, Lat, Long) and (CustomerID, Lat, Long). My output looks something like this:

CustomerID	Lat	Long	ClosestDealer	Distance
Customer1	61.61	-149.58	Dealer3	15.53
Customer2	42.37	-72.52	Dealer258	8.02
Customer3	42.42	-72.1	Dealer1076	32.92
Customer4	31.59	-89.87	Dealer32	3.85
Customer5	36.75	-94.84	Dealer726	7.90

My current Solution: Iterating through all the rows of data to find the min. distance will take too long. To optimize this, I have grouped the data in both tables based on the rounded down version of lat and long points and then added them together to arrive at my final group (see 'LatLongGroup' column below).

CustomerID	Lat	Long	LatGroup	LongGroup	LatLongGroup
Customer1	61.61	-149.58	61	-149	-88
Customer2	42.37	-72.52	42	-72	-30
Customer3	42.42	-72.1	42	-72	-30
Customer4	31.59	-89.87	31	-89	-58
Customer5	36.75	-94.84	36	-94	-58

Both these tables are sorted based on the 'LatLongGroup' column. And I have a separate table called group which provides the starting and ending row number of each group for the dealer table. I then match the records in the dealer table which have the same 'Latlonggroup' as that of the customerID. This helps me narrow down the search for the closest dealer.

But sometimes the closest dealer might not fall within the same group so to avoid any pitfalls, I search not only the group that matches but one above and below too. View Currently Used Code

Please let me know what would be the best way to optimize this or is there an easier way to find closest dealers for a large dataset like this. Any direction is greatly appreciated. Thank you!

col_names = ["CustomerKey","DealerKey","Dist"]
df = pd.DataFrame(columns = col_names)
c = 0
for i in range(0,len(df_c)):
    print(i)
    row = {'CustomerKey':df_c.loc[i,'ZIPBRANDKEY'],'DealerKey':'','Dist':0}
    df = df.append(row, ignore_index=True)
    a = group[group['LatLongGroup'] == df_c.LatLongGroup[i]].index[0]
    if(a-1 >= 0):
        start = group.loc[a-1,'Start']
    else:
        start = group.loc[a,'Start']
    if(a+1 < len(group)):
        end = group.loc[a+1,'End']
    else:
        end = group.loc[a,'End']
    t1 = 0
    for j in range(start,end):
        dist = round(geopy.distance.distance(df_c.Lat_long[i], df_s.Lat_long[j]).miles,2)
        if(t1 == 0):
            min_dist = dist
            dealerkey = df_s.loc[j,'DEALER_BRAND_KEY']
            t1 = 1
        elif(dist < min_dist):
            min_dist = dist
            dealerkey = df_s.loc[j,'DEALER_BRAND_KEY']
    df.loc[c,'DealerKey'] = dealerkey
    df.loc[c,'Dist'] = min_dist
    c = c+1
df.head()

For reference, the above mentioned group dataframe looks like this:

Group	Start	End
-138	0	7
-137	7	15
-136	15	53
-135	53	55
-88	55	78

You can use t[this approach](https://stackoverflow.com/questions/61924460/finding-similar-entries-in-python-lists/61924949#61924949) to find the nearest neighbor of each point in a dataset to a second dataset. Here, the questioner had millions of points in their dataset. — DarrylG, Apr 17 '21 at 00:28
Hey thank you for pointing me to that post. I have currently used the KDTree method and it works. I'll also look into the Ballree method. — user15663833, Apr 19 '21 at 20:03

Corralien · Accepted Answer · 2021-04-17T05:24:33.870

Generate sample data:

import pandas as pd

N = 25000
dealers = pd.DataFrame({"DealerID": "Dealer" + pd.RangeIndex(1, N+1).astype(str),
                        "Lat": np.random.uniform(30, 65, N),
                        "Long": np.random.uniform(-150, -70, N)}
                      ).set_index("DealerID")

N = 200000
customers = pd.DataFrame({"CustomerID": "Customer" + pd.RangeIndex(1, N+1).astype(str),
                          "Lat": np.random.uniform(30, 65, N),
                          "Long": np.random.uniform(-150, -70, N)}
                        ).set_index("CustomerID")

>>> dealers
                   Lat        Long
DealerID
Dealer1      53.923040  -96.238974
Dealer2      33.375229 -136.379545
Dealer3      30.635395 -107.639308
Dealer4      50.264205  -97.563283
Dealer5      52.366663 -130.242301
...                ...         ...
Dealer24996  62.369789 -140.430366
Dealer24997  43.079035 -126.496873
Dealer24998  43.858461  -97.471257
Dealer24999  34.433920 -135.038754
Dealer25000  61.967902  -95.496924

[25000 rows x 2 columns]

>>> customers
                      Lat        Long
CustomerID
Customer1       30.748900 -133.231319
Customer2       38.636134  -98.618844
Customer3       60.282135  -97.100096
Customer4       42.995473 -120.135218
Customer5       50.809563  -80.662491
...                   ...         ...
Customer199996  47.387618  -88.420528
Customer199997  53.618939 -124.432385
Customer199998  58.506937 -146.024708
Customer199999  48.329325 -129.149631
Customer200000  36.599969 -145.019091

[200000 rows x 2 columns]

You can use KDTree from Scipy:

from scipy.spatial import KDTree

distances, indices = KDTree(dealers).query(customers)

Few seconds later:

>>> customers.assign(ClosestDealer=dealers.iloc[indices].index, Distance=distances)
                      Lat        Long ClosestDealer  Distance
CustomerID
Customer1       30.748900 -133.231319   Dealer22102  0.189255
Customer2       38.636134  -98.618844    Dealer1510  0.282966
Customer3       60.282135  -97.100096    Dealer2715  0.182832
Customer4       42.995473 -120.135218   Dealer10539  0.423006
Customer5       50.809563  -80.662491   Dealer12022  0.091765
...                   ...         ...           ...       ...
Customer199996  47.387618  -88.420528   Dealer17124  0.325962
Customer199997  53.618939 -124.432385    Dealer9177  0.133110
Customer199998  58.506937 -146.024708   Dealer15558  0.299639
Customer199999  48.329325 -129.149631   Dealer18371  0.023172
Customer200000  36.599969 -145.019091    Dealer2316  0.199344

[200000 rows x 4 columns]

Hey thank you for this solution. This worked perfectly for my dataset and the run time was in seconds. :) — user15663833, Apr 19 '21 at 19:43

score 0 · Answer 2 · answered Apr 17 '21 at 03:45

I encountered the same problem a few weeks ago, and I feel the best way would be to use the K-Nearest Neighbors algorithm.

# Dropping the duplicates in order to get the unique dealer list

dealer_df = df_s[["DealerID", "LAT", "LNG"]].drop_duplicates()
dealer_df = dealer_df.set_index("DealerID")

from sklearn.neighbors import KNeighborsClassifier

# Instantiating with n_neighbors as 1 and weights as "distance"

knn = KNeighborsClassifier(n_neighbors=1, weights="distance", n_jobs=-1)
knn.fit(dealer_df.values, dealer_df.index)

df_c["Nearest Dealer"] = knn.predict(df_c[["LAT", "LNG"]].values)

I have used the same on nearly 1.8 mn data points and it took around 5 mins.

Hey thank you for this solution. For the dataset I had this was not completely feasible because even if I increase the 'n_neighbors' to make the search radius wider, some of my the nearest dealers are more than 200 records off the search radius and increasing the 'n_neighbors' to a larger value would again increase the run time of the program. But this might definitely work for a differently structured dataset. — user15663833, Apr 19 '21 at 19:42

Find the dealer nearest to the given customer location when the dealer data set has 25k+ records and customer has 200k+ records using Python

2 Answers2

Linked