0

Given a raw string input

1600 Divisadero St
San Francisco, CA 94115
b/t Post St & Sutter St 
Lower Pacific Heights

I want to extract

City:San Francisco
state:California or CA
Country:USA

I'll be parsing millions of addresses and using a Paid API is not feasible

I'm planning to use a Named Entity Recognizer but i'm unable to find a vast quantity of training data to ideally cover any location

Is there an opensource project out there which i may use?

wolfgang
  • 7,281
  • 12
  • 44
  • 72
  • Your input doesn't contain country but your output does, is that a mistake? or would you expect the program to look up the country based on city and state input – user3636636 Jul 16 '15 at 11:02
  • You'll need to provide more examples. Are all of the addresses in different formats, or can you always e.g. extract the second line to get the city and state? – Lynn Jul 16 '15 at 11:05
  • 2
    See also [here](http://stackoverflow.com/questions/3845006/database-of-countries-and-their-cities). – Lynn Jul 16 '15 at 11:06
  • @Mauris The Addresses are in different formats, regarding your link, i like the geonames database for all the world's cities and countries. I guess if you can extract the city name from the address string, you can now pinpoint the state and country – wolfgang Jul 16 '15 at 11:43
  • @Mauris Any ideas on how i may optimize the extraction of city from the address string, the brute force option is to run every word into the `cites` database and learn more about it – wolfgang Jul 16 '15 at 11:44
  • 1
    See [this](https://github.com/datamade/usaddress) – mbatchkarov Jul 16 '15 at 11:52
  • @mbatchkarov awesome! – wolfgang Jul 16 '15 at 11:53
  • "I'll be parsing millions of addresses and using a Paid API is not feasible..." That's understandable. On the off chance that your budget loosens up SmartyStreets offers services that could process millions of records in the course of a few hours. The results would be standardized *and* verified. There's an unlimited pricing option that would keep your cost fixed. It's really fast (millions per hour) and geo-distributed. I know because I'm a developer there and I work on the services I'm recommending. https://smartystreets.com/docs/address – Michael Whatcott Aug 05 '15 at 20:58
  • I dared to ask the same question and found an answer (at least for me): https://stackoverflow.com/a/66140761/1668622 – frans Feb 10 '21 at 16:23

2 Answers2

4

OpenStreetMap's geocoding solution Nominatim can be downloaded and set up on your own machine. This is an extremely tedious and time consuming process. You will need 500GB of free disk space, O(10s) of days to do the indexing, but at the end of it, you will have a full fledged geocoder on your own machine which should be able to handle your current needs and many more future ones.
If you go down this route, I recommend first trying out their example web api's to see if the quality is acceptable or not.
Totally worth looking into spending money and getting Google or Bing geocoder instead.

Aditya Mukherji
  • 9,099
  • 5
  • 43
  • 49
  • The best suggestion so far :) – wolfgang Jul 16 '15 at 20:43
  • do you know how rome2rio does it? – alvas Jul 16 '15 at 23:46
  • @wolfgang I haven't ever done this myself so don't know if this is a good idea. – Aditya Mukherji Jul 17 '15 at 07:05
  • @alvas didn't know about this site until just now.. but looks like they use OSM with a ton of custom magic on top - http://blog.rome2rio.com/2015/03/02/political-and-geographic-complexity-in-multi-modal-search/ .. if geocoding is a core part of their business, it makes sense to invest that much effort into this – Aditya Mukherji Jul 17 '15 at 07:08
0

@adi92's Answer is the best choice here, but requires a very beefy machine with many many cores and huge RAM to index the entire database. For those requiring lesser computation www.geonames.org is pretty comprehensive enough for city, state, country only.

wolfgang
  • 7,281
  • 12
  • 44
  • 72