3

I have a list of US ZIP codes and I have to calculate distance between all the ZIP Code Points. Its a 6k ZIPs long list, each entity has ZIP, City, State, Lat, Long, Area and Population.

So, I have to calculate distance between all the points, ie; 6000C2 combinations.

Here is a sample of my data

enter image description here

I've tried this in SAS but its too slow and inefficient, hence I'm looking for a way using Python or R.

Any leads would be appreciated.

lightyagami96
  • 336
  • 1
  • 4
  • 14
  • may be [this](https://stackoverflow.com/questions/40452759/pandas-latitude-longitude-to-distance-between-successive-rows) can help you – anky Sep 12 '18 at 10:18
  • 2
    One option, perhaps not the best, would be the use the Haversine formula with the lat/lon coordinates from a pair of ZIP codes. – Tim Biegeleisen Sep 12 '18 at 10:22
  • @TimBiegeleisen after referring to a few articles, i think Haversine formula wouldn't be truly accurate – lightyagami96 Sep 12 '18 at 10:24
  • 1
    [This](https://eurekastatistics.com/calculating-a-distance-matrix-for-geographic-points-using-r/) might help for R. What does the heavy lifting is the function [gdist](https://www.rdocumentation.org/packages/Imap/versions/1.32/topics/gdist) in the package `imap`. More generally, there are a number of [GIS packages in R](https://cran.r-project.org/web/views/Spatial.html), doubtless with various pros and cons involving computational speed, accuracy, and ease of use. – John Coleman Sep 12 '18 at 10:28
  • 1
    For more accuracy, you can use __Vincenty's algorithm__ which is based on a model of the earth as an ellipsoid. – Heaven Sep 12 '18 at 11:16
  • "I tried this in SAS..." Can you show what you tried ? What benchmarks made you determine it was slow and inefficient ? What workstation or server are you running SAS on ? Curious, what problem is requiring you to compute all distances first ? – Richard Sep 12 '18 at 12:36
  • A hash table should be pretty fast, and it's going to be a 18M row data set, so would take a few minutes. What time are you expecting? – Reeza Sep 12 '18 at 15:03
  • You could also use pre-calculated data (SAS is an option ) and do a lookup or format. Not sure how fast a format may be. updated link for most recent data. http://www.nber.org/data/census-2010-zip-code-data.html – Reeza Sep 12 '18 at 21:08

3 Answers3

9

Python Solution

If you have the corresponding latitudes and longitudes for the Zip codes, you can directly calculate the distance between them by using Haversine formula using 'mpu' library which determines the great-circle distance between two points on a sphere.

Example Code :

import mpu

zip_00501 =(40.817923,-73.045317)
zip_00544 =(40.788827,-73.039405)

dist =round(mpu.haversine_distance(zip_00501,zip_00544),2)
print(dist)

You will get the resultant distance in kms. Output:

3.27

PS. If you don't have the corresponding coordinates for the zip codes, you can get the same using 'SearchEngine' module of 'uszipcode' library (only for US zip codes)

from uszipcode import SearchEngine
#for extensive list of zipcodes, set simple_zipcode =False
search = SearchEngine(simple_zipcode=True)

zip1 = search.by_zipcode('92708')
lat1 =zip1.lat
long1 =zip1.lng

zip2 =search.by_zipcode('53404')
lat2 =zip2.lat
long2 =zip2.lng

mpu.haversine_distance((lat1,long1),(lat2,long2))

Hope this helps!!

jitesh2796
  • 164
  • 6
  • 14
  • Hi, We tried the above solution exactly with zip these codes: 00501 & 00544 but we are getting this error. [ ..\mpu\__init__.py", line 193, in haversine_distance if not (-90.0 <= lat1 <= 90): TypeError: '<=' not supported between instances of 'float' and 'NoneType' ]. Could you let me know what I'm missing? – Santana Jun 08 '20 at 04:38
  • hey, it seems like its not able to identify one of of the zip codes. Can you once try with setting "simple_zipcode" =False? Let me know if that doesn't work, will deep dive into it. – jitesh2796 Jun 09 '20 at 20:40
  • Hi Jitesh, thanks for your answer. We tried these zip codes 03060 and 03062. They worked fine with or without 'False' but when we used 03060 and 02222, the same error returned for False & True. – Santana Jun 11 '20 at 05:22
  • Hi Santana, I checked with the zip code 02222. Seems like the latitude and longitudes are not present corresponding to this zip in the uszipcode library. If you'll just print various variables, zip1, zip2 ,lat and long, etc. you'll be able to see the info available. I checked the zip code online and found the lat and long,,,and using the values it works fine. uszipcode might not have exhaustive info of all zip codes and you may search if there is any other library available. Hope this helps.... – jitesh2796 Jun 11 '20 at 07:19
4

In SAS, use the GEODIST function.

GEODIST Function

Returns the geodetic distance between two latitude and longitude coordinates.

Syntax

GEODIST(latitude-1, longitude-1, latitude-2, longitude-2 <, options>)

Richard
  • 25,390
  • 3
  • 25
  • 38
2

R solution

#sample data: first three rows of data provided
df <- data.frame( zip = c( "00501", "00544", "00601" ),
                  longitude = c( -73.045075, -73.045147, -66.750909 ),
                  latitude = c( 40.816799, 40.817225, 18.181189 ),
                  stringsAsFactors = FALSE )

library( sf ) 

#create a spatial data.frame
spdf <- st_as_sf( x = df, 
                  coords = c( "longitude", "latitude"), 
                  crs = "+proj=longlat +datum=WGS84" )

#create the distance matrix (in meters), round to 0 decimals
m <- round( st_distance( spdf ), digits = 0 )

#set row and column names of matrix
colnames( m ) <- df$zip
rownames( m ) <- df$zip

#show distance matrix in meters
m 

# Units: m
#         00501   00544   00601
# 00501       0      48 2580481
# 00544      48       0 2580528
# 00601 2580481 2580528       0
Wimpel
  • 26,031
  • 1
  • 20
  • 37