I am a beginner in pandas.
I have a DataFrame which has venue ID and its latitude and longitude as columns.I need to make a separate Dataframe which finds distance between each venues. There are 38333 venues, and running a 38333*38333 loop seems impractical. Can anyone give me a better solution?
Asked
Active
Viewed 1,178 times
2
-
You need a loop of (38333*38332)/2 iterations. There is no other way to solve your problem. – DYZ Mar 18 '19 at 18:38
-
2Perhaps with the memory you could perform the cartesian product, giving you 1.5B rows, then implement the [vectorized haversine](https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas/29546836#29546836) – ALollz Mar 18 '19 at 18:44
1 Answers
3
if you want an example of what you could do:
def haversine_np(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
# =========== just to create random lat and long
from random import uniform
def newpoint():#long,lat
return uniform(-180, 180), uniform(-90, 90)
n=5 #choose the number of random points
points = (newpoint() for x in range(n))
lon = [x for x,y in points]
points = (newpoint() for x in range(n))
lat = [y for x,y in points]
id = [x for x in range(n)]
df = pd.DataFrame({'id': id, 'Latitude': lat, 'Longitude': lon})
print(df)
output of df example:
id Latitude Longitude
0 0 30.052750 -35.294843
1 1 60.588742 -124.559868
2 2 -23.872878 -21.469725
3 3 -67.234086 -95.865194
4 4 -26.889749 -179.668853
def distance_ids(orig, dest):
return dist[np.abs(orig - dest)][np.amin([orig, dest])]
lat = df['Latitude'].values;lon = df['Longitude'].values
# if enough mem, you could calculate the distances between all points
dist=[]
for index in range(len(lat)):
d = haversine_np(np.roll(lon, -index), np.roll(lat, -index), lon, lat)
# you could include the result in dataframe
df[f'0 to {index}'] = pd.Series(dist)
# or you could append the result in big array
dist.append(d)
# in this case, you could trap the distance between 2 ids
# with the function: distance_ids(3, 4) for example
# you could just calculate the distances between one id and all others ids
#for id = 2 for example,
index = 2
lat1 = np.repeat(lat[2], len(lat))
lon1 = np.repeat(lon[2], len(lat))
#dist_index contains an array of all distances from id 2 to all others ids
dist_index = haversine_np(lat1, lon1, lon, lat)

Frenchy
- 16,386
- 3
- 16
- 39
-
-
1Thanks for this informative code, it worked out great and also fast, I wanted to exploit the speed of numpy as running loops just hanged my laptop, Thanks again ..I learned a lot – Sugato Mar 31 '19 at 12:58
-
1@Sugato, if this answer helps you, please, dont forget to uvpvote/validate the answer – Frenchy Apr 10 '19 at 11:25
-
1@Frenchy...I upvoted the answer the first time, unfortunately i am new to this community hence due to lack of 'reputation' my upvotes are not visible.....sorry for any inconvenience...and thanks again for the solution – Sugato Apr 18 '19 at 07:06
-