4

I have a 500,000 list of latitudes and longitudes coordinates like below:

Latitude   Longitude  
42.022506  -88.168156  
41.877445  -87.723846  
29.986801  -90.166314  

I am looking to use python to get the city, state, and country for each coordinate in a new column like below:

Latitude   Longitude   City        State   Country
42.022506  -88.168156  Streamwood  IL      United States
41.877445  -87.723846  Chicago     IL      United States
29.986801  -90.166314  Metairie    LA      United States

With this large of a dataset, how can this be achieved in python? I have heard of Google's API, Nominatim's API, and Geopy package.

How do I get to run through all of the rows into this code? Right now I have to manually input the latitude and longitude into the last line.

import csv 
import pandas as pd
import numpy as np
import math
from geopy.geocoders import Nominatim

input_file = "Lat-Log.csv" # file contains ID, Latitude, Longitude
output_file = "output.csv"
df = pd.read_csv(input_file) 

geolocator = Nominatim(user_agent="geoapiExercises")
def city_state_country(coord):
    location = geolocator.reverse(coord, exactly_one=True)
    address = location.raw['address']
    city = address.get('city', '')
    state = address.get('state', '')
    country = address.get('country', '')
    return city, state, country
print(city_state_country("47.470706, -99.704723"))

The output gives me ('Bowdon', 'North Dakota', 'USA'). I am looking to replace the coordinates with my columns (latitude and longitude) to run through my list. How do I input my columns into the code to run through the whole document?

Nick
  • 207
  • 1
  • 2
  • 11
  • 3
    Is there something preventing you from doing the research on this yourself? Usually, StackOverflow questions which are just asking how to do something while having no code or attempts shown tend to get downvoted and closed. I think more research/effort on your part needs to be shown. Otherwise this is basically a "I want some code, gimme" post. See: [How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/q/261592/6273251) Edit: Okay, looks like you did provide some code, but please explain what part you're stuck in regarding the columns. – Random Davis Oct 01 '21 at 16:35
  • It looks like you have code that does exactly what you are asking about. What do you want the output to be instead? Rather than talking about "replacing columns", or other code-related details, it is often more constructive to talk about the output. – Code-Apprentice Oct 01 '21 at 16:43
  • If you don't want to pre-sort the coordinates to eliminate duplicates, at least cache the retrieved values so you don't call your api multiple times with the same coordinates. – RufusVS Oct 01 '21 at 16:46
  • The output gives me the city, state, and country for that exact location. I had to manually input the latitude and longitude into the `print(city_state_country("47.470706, -99.704723"))`. I was looking to replace the coordinates with my latitude and longitude columns and give me the city, state, and country for every single coordinates and put them in a new column. – Nick Oct 01 '21 at 16:46
  • Are you saying the list is currently in a file? And you want to read the list in, request the city state zip and output the lists as a 5 column table? Considering you are passing your coordinates as a single string to your function, you won't even need to parse your file. – RufusVS Oct 01 '21 at 16:51
  • Yes, I have imported the .csv into python and consists of 3 columns: ID, Latitude, Longitude. I would like to request the city, state, and country in new columns for each ID, Latitude, Longitude making a total of 6 columns. – Nick Oct 01 '21 at 16:54
  • @Nick ok, so then tell us what you specifically need help with rather than just saying in general what you want to accomplish. – Random Davis Oct 01 '21 at 16:59
  • I need help inputting the columns latitude and longitude into my code to run through every coordinate to achieve the output. – Nick Oct 01 '21 at 17:07
  • @Nick So you have imported the .csv? there is no code in your question that shows that. The answer would be there too, probably: just use the variables you stored the values in and pass them to your function, instead of the string constant in your question. – RufusVS Oct 01 '21 at 17:11
  • @RufusVS I added the code of importing and exporting the files. Do I replace the `print(city_state_country("47.470706, -99.704723"))` with the column headers in my input file? – Nick Oct 01 '21 at 17:18
  • 1
    Did you write the sample code all by yourself ? –  Oct 01 '21 at 17:22
  • So you loaded a CSV file into a pandas data frame and you want to calculate city/state/country for each row in that data frame. Is that correct? As a python programmer with little to no experience with pandas, I would write a `for` loop to do this. I think pandas has better ways, though. – Code-Apprentice Oct 01 '21 at 18:56

1 Answers1

6

You want to run a function on each row, which can be done using apply().

There are two complications, which is that you want to 1) provide multiple arguments to the function, and 2) get back multiple results.

These questions explain how to do those things:

Here's how to adapt your code to do this:

import pandas as pd
import io
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="geoapiExercises")

s = """Latitude   Longitude  
42.022506  -88.168156  
41.877445  -87.723846  
29.986801  -90.166314"""

df = pd.read_csv(io.StringIO(s), delim_whitespace=True)

def city_state_country(row):
    coord = f"{row['Latitude']}, {row['Longitude']}"
    location = geolocator.reverse(coord, exactly_one=True)
    address = location.raw['address']
    city = address.get('city', '')
    state = address.get('state', '')
    country = address.get('country', '')
    row['city'] = city
    row['state'] = state
    row['country'] = country
    return row

df = df.apply(city_state_country, axis=1)
print(df)

(I replaced your read_csv() call with an inline definition of the dataframe. Ignore that. It's not important to the example. I did that to make the example self-contained.)

The city_state_country() function gets called with every row of the dataframe. (The axis=1 argument makes apply() run using rows rather than columns.) The function gets the lat and lon, and does a query. Then, it modifies the row to include the information from the query.

This gets the following result:

    Latitude  Longitude     city      state        country
0  42.022506 -88.168156            Illinois  United States
1  41.877445 -87.723846  Chicago   Illinois  United States
2  29.986801 -90.166314           Louisiana  United States

Not the same as your example, but Nominatim doesn't seem to return a city for two your your coordinates. (It calls them towns, not cities.)

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • 1
    You present a perfectly good pattern for this kind of problem. However, it needs to be borne in mind (for this particular case) that the geolocator services (unless paid for) are likely to stop working if you try to hit them with half a million queries in quick succession. –  Oct 01 '21 at 17:31
  • @BrutusForcus True. There are options for doing [hundreds of thousands of geocoding queries](https://wiki.openstreetmap.org/wiki/Nominatim#Alternatives_.2F_Third-party_providers), which are compatible with Nominatim. If you want to use a different provider, the domain of the provider can be specified in the constructor, as documented [here](https://geopy.readthedocs.io/en/stable/index.html?highlight=nominatim#geopy.geocoders.Nominatim.__init__) – Nick ODell Oct 01 '21 at 18:23
  • What's the best way for me to run through my whole data frame? Do I replace "geoapiExercises" with my email address? – Nick Oct 01 '21 at 18:51
  • 1
    @Nick It's not clear to me that you ought to run through the whole dataframe at once. The Nominatim [API guidelines](https://operations.osmfoundation.org/policies/nominatim/) say to only do 1 request per second. At that rate, it would take more than five days to do half a million requests. If your internet goes down or your computer crashes, you will have to restart from scratch. You should come up with a solution which can do a handful of requests, save its work, and continue, as well as pick up after a crash. – Nick ODell Oct 01 '21 at 18:54
  • 2
    @Nick I would create a separate table of positions and locations, and only request UNIQUE positions so you don't repeat requests. In fact you should probably filter locations that are within some hundreds of feet of each other, assuming they are actually within the same city and therefore needn't be requested. Then fill in your full database from that lookup database. – RufusVS Oct 01 '21 at 19:13