-1

I am a newbie in python, but I like to process data in pandas.

I have a hundred pairs of CSV data such as passenger and bus stop data. The passenger structure data is Person, and XY coordinates (UTM-Meter). The bus stop data structure is the bus stop name (BS), and XY coordinates (UTM-Meter. My data looks like this:

df(person)
Person  X           Y
A_first 260357.3199 4064458.685
A_last  261618.5243 4064789.928
B_first 258270.5926 4063717.789
B_last  258270.5926 4063717.789
C_first 259051.758  4064462.021
C_last  258270.5926 4063717.789
D_first 260764.3916 4064624.977
D_last  260753.9053 4064569.745
E_first 258270.5926 4063717.789
E_last  258270.5926 4063717.789
F_first 258262.2825 4063740.234
F_last  258270.5926 4063717.789
G_first 258326.2393 4064104.899
G_last  258270.5926 4063717.789
H_first 259491.5911 4064838.328
H_last  260747.1408 4064647.669

and,

df(bus stop)
BS  X           Y 
BS1 258280.8228 4063715.835
BS2 259961.3734 4064840.298
BS3 260360.0219 4064468.593
BS4 260530.5126 4064683.101
BS5 260806.7009 4064399.184
BS6 261178.4165 4063605.91
BS7 261222.1745 4064495.158
BS8 261627.6173 4064281.38
BS9 261662.8833 4064793.444

I want to analyze the distance matrix between person and bus stop data. My desire output is InputID (Person), TargetID (bus stop) and distance. Distance is the distance between person point data and bus stop point data. My expected result:

result
InputID (Person)    TargetID (BS)   Distance (meter)
A_first                 BS1                 xx
A_first                 BS2                 xx
A_first                 BS3                 xx
A_first                 BS4                 xx
A_first                 BS5                 xx
A_first                 BS6                 xx
A_first                 BS7                 xx
A_first                 BS8                 xx
A_first                 BS9                 xx
A_last                  BS1                 xx
A_last                  BS2                 xx
A_last                  BS3                 xx
A_last                  BS4                 xx
A_last                  BS5                 xx
A_last                  BS6                 xx
A_last                  BS7                 xx
A_last                  BS8                 xx
A_last                  BS9                 xx
............. so on

Any advice for me? thank you

Arief Hidayat
  • 937
  • 1
  • 8
  • 19

2 Answers2

1
  1. First you need to create a dataframe that is the cartestian product of your two dataframe. That means that for each person, there is a row with each bus stop, just like you wrote. You can see how to do that with Python here for example.

  2. Now, on that new dataframe, you need to compute the distance on each row between two pairs of coordinates (the person coordinates and the bus stop coordinates). For that, you can use for example GeoPandas that has a distance method.

Istopopoki
  • 1,584
  • 1
  • 13
  • 20
1

You can first create an outer join using

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    'Person': ['A_first', 'A_last', 'B_first', 'B_last'],
    'X': [260357.3199, 261618.5243, 258270.5926, 258270.5926],
    'Y': [4064458.685, 4064789.928, 4063717.789, 4063717.789],
})

df2 = pd.DataFrame({
    'BS': ['BS1', 'BS2'],
    'X': [258280.8228, 259961.3734],
    'Y': [4063715.835, 4064840.298],
})


df1['key'] = 0
df2['key'] = 0

df_cartesian = df1.merge(df2, on='key').drop(columns=['key'])

Then calculate the row-wise X and Y difference, and take the L2 norm:

df_cartesian['X_diff'] = df_cartesian['X_x'] - df_cartesian['X_y']
df_cartesian['Y_diff'] = df_cartesian['Y_x'] - df_cartesian['Y_y']

df_cartesian['dist'] = np.linalg.norm(df_cartesian[['X_diff', 'Y_diff']], axis=1)

df_cartesian
#     Person          X_x          Y_x   BS          X_y          Y_y  \
# 0  A_first  260357.3199  4064458.685  BS1  258280.8228  4063715.835   
# 1  A_first  260357.3199  4064458.685  BS2  259961.3734  4064840.298   
# 2   A_last  261618.5243  4064789.928  BS1  258280.8228  4063715.835   
# 3   A_last  261618.5243  4064789.928  BS2  259961.3734  4064840.298   
# 4  B_first  258270.5926  4063717.789  BS1  258280.8228  4063715.835   
# 5  B_first  258270.5926  4063717.789  BS2  259961.3734  4064840.298   
# 6   B_last  258270.5926  4063717.789  BS1  258280.8228  4063715.835   
# 7   B_last  258270.5926  4063717.789  BS2  259961.3734  4064840.298   
# 
#       X_diff    Y_diff         dist  
# 0  2076.4971   742.850  2205.372152  
# 1   395.9465  -381.613   549.911004  
# 2  3337.7015  1074.093  3506.269681  
# 3  1657.1509   -50.370  1657.916235  
# 4   -10.2302     1.954    10.415138  
# 5 -1690.7808 -1122.509  2029.474358  
# 6   -10.2302     1.954    10.415138  
# 7 -1690.7808 -1122.509  2029.474358  
Nils Werner
  • 34,832
  • 7
  • 76
  • 98
  • works perfectly. this is very cool. I tried to combine this code result with 'distance matrix' from QGIS result. The difference is about 0.1 until 1 meter. Pardon me what is 'np.linalg.norm'? – Arief Hidayat Jun 20 '19 at 08:19
  • some infos about the precison of the distance computed that way : https://math.stackexchange.com/questions/738529/distance-between-two-points-in-utm-coordinates – Istopopoki Jun 20 '19 at 08:45