I am using NYC trips data. I wanted to convert the lat-long present in the data to respective boroughs in NYC. I especially want to know if there is some NYC airport (Laguardia/JFK) present in one of those trips.
I know that Google Maps API and even libraries like Geopy get the reverse geocoding. However, most of them give city and country level codings.
I wanted to extract the borough or airport (like Queens, Manhattan, JFK, Laguardia etc) name from the lat-long. I have lat-long for both pickup and dropoff locations.
Here is a sample dataset in pandas dataframe.
VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
0 2 2015-09-01 00:02:34 2015-09-01 00:02:38 N 5 -73.979485 40.684956 -73.979431 40.685020 1 0.00 7.8 0.0 0.0 1.95 0.0 NaN 0.0 9.75 1 2.0
1 2 2015-09-01 00:04:20 2015-09-01 00:04:24 N 5 -74.010796 40.912216 -74.010780 40.912212 1 0.00 45.0 0.0 0.0 0.00 0.0 NaN 0.0 45.00 1 2.0
2 2 2015-09-01 00:01:50 2015-09-01 00:04:24 N 1 -73.921410 40.766708 -73.914413 40.764687 1 0.59 4.0 0.5 0.5 0.50 0.0 NaN 0.3 5.80 1 1.0
In [5]:
You can find the data here too:
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
After bit of research I found I can leverage Google Maps API, to get the county and even establishment level data.
Here is the code I wrote:
A mapper function to get the geocode data from Google API for the lat-long passed
def reverse_geocode(latlng):
result = {}
url = 'https://maps.googleapis.com/maps/api/geocode/json?latlng={}'
request = url.format(latlng)
data = requests.get(request).json()
if len(data['results']) > 0:
result = data['results'][0]
return result
# Geo_code data for pickup-lat-long
trip_data_sample["est_pickup"] = [y["address_components"][0]["long_name"] for y in map(reverse_geocode, trip_data_sample["lat_long_pickup"].values)]
trip_data_sample["locality_pickup"]=[y["address_components"][2]["long_name"] for y in map(reverse_geocode, trip_data_sample["lat_long_pickup"].values)]
However, I initially had 1.4MM records. It was taking lot of time to get this done. So I reduced to 200K. Even that was taking lot of time to run. So then I reduced to 115K. Even that taking too much time.
So now I reduced to 50K. But then this sample would hardly be having a representative distribution of the whole data.
I was wondering if there is any better and faster way to get the reverse geocode of lat-long. I am not using Spark since I am running it on local mac. So using Spark might not give that much speed leverage on single machine. Pls advise.