0

df1:

id latlong_tuple
364 (17.3946820179646, 78.042644262359)
365 (17.3945761480423, 78.0427466415321)
1085 (17.3950200947952, 78.0432334533569)
1086 (17.3947638830589, 78.0430426797909)
1087 (17.3945460707558, 78.0430666916614)

df2

index latlong_tuple
01 (17.431952, 78.37396)
02 (17.48295, 78.306694)
03 (17.479734, 78.34914)
04 (17.368366, 78.38604)
05 (17.433102, 78.37506)
def tileId_mapping(sample_cord, tile_cord, tile):
    result = []
    for i in tqdm(range(0, len(sample_cord))):
        dis_list=[]
        for j in range(0, len(tile_cord)):
            dis = hs.haversine(sample_cord[i], tile_cord[j], unit=Unit.METERS)
            dis_list.append(dis)
        shortest_dis = min(dis_list)
        min_index = dis_list.index(shortest_dis)
        result.append(id_tile[min_index])
    return result

This code is too slow to when the size of df1 is 320096 and df2 is 5299669. Can someone please help me to make it faster ?

Thanks in advance.


I want the df1.id against each df2.latlong_tuple. This df1.id should be assigned based on the shortest distance from df2.latlong_tuple.

I want result something like below ,
df2
| index | latlong_tuple | Id |
|------ |-----------------------------------|----|
| 01 |(17.431952, 78.37396) |356 |

  • 1
    You might want a geopandas spatial index. This can give you overlaps / intersections quickly. https://geopandas.org/en/stable/docs/reference/api/geopandas.sindex.SpatialIndex.query.html For minimum distance in a graph see the networkx shortest path algo: https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.shortest_paths.generic.shortest_path.html#networkx.algorithms.shortest_paths.generic.shortest_path – Erotemic Feb 18 '22 at 15:41
  • haversine_vector perhaps? – Severin Pappadeux Feb 18 '22 at 15:42
  • Also, using for loops like this is very inefficient. You should be able to vectorize the calculation with numpy / pandas. Not sure if your haversine function is able to be vectorized, but I'm sure there is a vectorized implementation out there. In that case the O(N^2) approach will be faster than your current implementation. – Erotemic Feb 18 '22 at 15:43
  • Thanks Erotemic, I will try with numpy vectors and check. – Moni Priyanka Feb 19 '22 at 10:20
  • Is this the same as https://stackoverflow.com/questions/63835785/python-code-to-filter-closest-distance-pairs/63844625#63844625 ? – Willem Hendriks Feb 19 '22 at 20:42
  • Also, appending to a list is slow. If you want to do it this way, try using a deque from the collections module. It may help. – ryanskeith Feb 19 '22 at 20:45

1 Answers1

0

Maybe you can use haversine like in this answer

First of all, you should have two columns, one for your latitude and the other one for your longitute.

You could use something like this:

df1[['LAT', 'LONG']] = pd.DataFrame(df1['latlong_tuple'].to_list())

# same for the second dataframe
df2[['LAT', 'LONG']] = pd.DataFrame(df2['latlong_tuple'].to_list())

Then, compute the vectorized haversine function like so:

# vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
    """
    All (lat, lon) coordinates must have numeric dtypes and be of equal length.
    """
    if to_radians:
        lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
    
    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
    
    return earth_radius * 2 * np.arcsin(np.sqrt(a))


df1['dist'] = haversine(df1.LAT.shift(), df1.LONG.shift(), df2.LAT.shift(), df2.LONG.shift())
Gamliel Cohen
  • 138
  • 3
  • 15
  • I want the df1.id against each df2.latlong_tuple. This df1.id should be assigned based on the shortest distance from df2.latlong_tuple. I want result something like below , df2 | index | latlong_tuple | Id | |------ |-----------------------------------|----| | 01 |(17.431952, 78.37396) |356 | – Moni Priyanka Feb 19 '22 at 10:50