2

I am a beginner in pandas. I have a DataFrame which has venue ID and its latitude and longitude as columns.I need to make a separate Dataframe which finds distance between each venues. There are 38333 venues, and running a 38333*38333 loop seems impractical. Can anyone give me a better solution? dataframe snapshot

DYZ
  • 55,249
  • 10
  • 64
  • 93
Sugato
  • 35
  • 8
  • You need a loop of (38333*38332)/2 iterations. There is no other way to solve your problem. – DYZ Mar 18 '19 at 18:38
  • 2
    Perhaps with the memory you could perform the cartesian product, giving you 1.5B rows, then implement the [vectorized haversine](https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas/29546836#29546836) – ALollz Mar 18 '19 at 18:44

1 Answers1

3

if you want an example of what you could do:

def haversine_np(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

# =========== just to create random lat and long
from random import uniform
def newpoint():#long,lat
    return uniform(-180, 180), uniform(-90, 90)

n=5         #choose the number of random points
points = (newpoint() for x in range(n))
lon = [x for x,y in points]
points = (newpoint() for x in range(n))
lat = [y for x,y in points]
id = [x for x in range(n)]
df = pd.DataFrame({'id': id, 'Latitude': lat, 'Longitude': lon})
print(df)

output of df example:

   id   Latitude   Longitude
0   0  30.052750  -35.294843
1   1  60.588742 -124.559868
2   2 -23.872878  -21.469725
3   3 -67.234086  -95.865194
4   4 -26.889749 -179.668853

def distance_ids(orig, dest):
    return dist[np.abs(orig - dest)][np.amin([orig, dest])]

lat = df['Latitude'].values;lon = df['Longitude'].values

    # if enough mem, you could calculate the distances between all points
dist=[]
for index  in range(len(lat)):
    d = haversine_np(np.roll(lon, -index), np.roll(lat, -index), lon, lat)
    # you could include the result in dataframe
    df[f'0 to {index}'] = pd.Series(dist)
    # or you could append the result in big array
    dist.append(d)
    # in this case, you could trap the distance between 2 ids
    # with the function: distance_ids(3, 4) for example

# you could just calculate the distances between one id and all others ids
#for id = 2 for example,
index = 2
lat1 = np.repeat(lat[2], len(lat))
lon1 = np.repeat(lon[2], len(lat))
#dist_index contains an array of all distances from id 2 to all others ids
dist_index = haversine_np(lat1, lon1, lon, lat)
Frenchy
  • 16,386
  • 3
  • 16
  • 39
  • is it okay for you? – Frenchy Mar 25 '19 at 12:50
  • 1
    Thanks for this informative code, it worked out great and also fast, I wanted to exploit the speed of numpy as running loops just hanged my laptop, Thanks again ..I learned a lot – Sugato Mar 31 '19 at 12:58
  • 1
    @Sugato, if this answer helps you, please, dont forget to uvpvote/validate the answer – Frenchy Apr 10 '19 at 11:25
  • 1
    @Frenchy...I upvoted the answer the first time, unfortunately i am new to this community hence due to lack of 'reputation' my upvotes are not visible.....sorry for any inconvenience...and thanks again for the solution – Sugato Apr 18 '19 at 07:06
  • no problem...so you cant validate if you cant upvote?? – Frenchy Apr 18 '19 at 07:10