0

I am required to identify the Postal/Zip code for each record in a dataset which contains location information in the following format hosted in an SQL server DB,

Location dataset

For identifying Postal/Zip codes for a location I'm using the information provided by Geo Names which is in the format Country code, Postal code, Place, data1, data2, data3, data4, latitude, longitude and accuracy as follows in a text file,

Zip codes and their lat lon location

How can I identify the zip code for each record in the first dataset (checking whether a locations lat lang falls within the lat lang of a zip code). I have attempted to identify any previous problems related to this topic but I could only find this question where the answer does not elaborate on the procedure or calculation required to arrive to the answer (only provides a language specific solution).

Ps. This is a requirement that arose for my undergrad final year project.

Rukshan Hassim
  • 505
  • 6
  • 15
  • 1
    For each record you need to find the *nearest* postcode, but since you haven't told us what your database server is, we can't help with that part. There should be plenty of questions that have solved this already though. – DavidG Feb 20 '19 at 16:56
  • Where are you stuck ? https://stackoverflow.com/help/mcve – Soufiane Tahiri Feb 20 '19 at 16:56
  • @DavidG apologies for not listing the technologies. I have updated the question now. – Rukshan Hassim Feb 20 '19 at 17:04
  • SQL Server will do all this for you then. – DavidG Feb 20 '19 at 17:07
  • @SoufianeTahiri Extremely sorry for any confusions. I have to identify the zip codes for each record in the first dataset. I acquired a text file with zip code and lat lang information but I'm unsure on how to proceed to check whether a record falls within the lat lang of a zip code. – Rukshan Hassim Feb 20 '19 at 17:08
  • Note that an area of a ZIP code usually comes in all form of irregular and weird shapes. If you can identify and determine wheter a certain (lat,long) point is within a ZIP area based on a single point representing this ZIP area, then you will have to accept errors, because distance from this point alone is not sufficent to determine accurately whether you are still in this ZIP area or not... –  Feb 20 '19 at 17:21
  • Agree 110% that zip code boundaries are irregular. Please correct me if I am mistaken, but if that is the GeoNames data you are showing in the second example, it appears that the database author is attempting to define a "center" or other significant point for the particular zip code by the lat/long they give? Is that not the point you'd shoot for when defining the closest point? For example, 40.79, -74.02 IS the center of zip 04047 (North Bergan) for the purposes of this database? – markaaronky Feb 20 '19 at 17:26
  • You can't get an exact zip code but you can get best. The task is really comparing a Point with a List of Points and finding closest match. – jdweng Feb 20 '19 at 17:42

1 Answers1

1

One method that will work, though it's a bit of brute force, is to apply the Haversine Formula to the the pair of (Lat,Lon) in your initial dataset (your first picture) to each of (Lat,Lon) pairs in your GeoNames database.

The Haversine formula gives you the great circle distance (remember, the Earth is a globe) between two lat/lon pairs. If you take your first example row, lat 40.769, lon -73.9545, and cycle through each of the records in your GeoNames database calculating the great circle distance (for North Bergan, Weehawken, Union City, etc.), keeping track of the closest distance you've found, that should give you a good indicator of the correct zip code.

markaaronky
  • 1,231
  • 12
  • 29
  • Good point with regard to Haversine, but in this application scenario here (ZIP codes for a location), the earth surface can be assumed flat, as an area of a ZIP code shouldn't be that large for the curvature of the earth surface to become a significant factor. Also, since ZIP code areas are usually of rather irregular shape, a certain fuzziness/error cannot be avoided anyway when just one point location per ZIP code area is given –  Feb 20 '19 at 17:10
  • A valid point on assumption of flatness elgonzo. I offer Haversine not just because usefulness anywhere (flat or globe), but for its ability to easily calculate distance given lat/lon pairs. True, another function could admittedly be substituted. But I think the approach would be sound. Your point about the fuzziness of zip code boundaries is a good one. Here I think we are at the mercy of whoever threw the dart at the GeoNames map and decided that given point defines the "center" (?) of the given zip code :-) – markaaronky Feb 20 '19 at 17:20
  • 1
    I was not trying to discredit your answer and claiming that Haversine is not a sound approach. Sorry, if my comment could be interpreted that way, but it was not my intention. Of course using Haversine is a sound approach. I was just trying to point out that within the limits of the given usage scenario here, a simplified approach might also be feasible. Again, my apologies if this came across wrong... –  Feb 20 '19 at 17:24
  • I didn't take it that way at all elgonzo!!! You are completely correct and it adds value to the discussion to point out that when you aren't considering great distances, you really don't need to resort to Haversine to get accurate answers! Kudos to you for enhancing the discussion! – markaaronky Feb 20 '19 at 17:29
  • 1
    I tried finding other methods but ended up using this method as its very straightforward. – Rukshan Hassim Feb 25 '19 at 18:31