2

Suppose I have this df_atm:

     borough          Longitude     Latitude

0    bronx              40.79        -73.78    
1    manhattan          40.78        -73.90
2    staten island      40.84        -73.95
3    NaN                40.57        -74.11

Every row represents an ATM withdrawal.

I hope to generate value for missing value based on the coordinate inside the Longitude and Latitude columns.

     borough          Longitude     Latitude

0    bronx              40.79        -73.78    
1    manhattan          40.78        -73.90
2    staten island      40.84        -73.95
3    staten island      40.57        -74.11

Since coordinate [40.57, -74.11] are inside Staten Island's borough.

I have generated a dict with boroughs' coordinates:

borough_dict = {"Bronx" : [40.837048, -73.865433], "Brooklyn" : [40.650002, -73.949997], "Manhattan" : [40.758896, -73.985130], "Queens" : [40.742054,-73.769417], "Staten Island" : [40.579021,-74.151535]}

And this is what I try so far (code/pseudocode):

df_atm['borough'] = df_atm.apply(
lambda row: **idk what do to here** if np.isnan(row['borough']) else row['borough'],
axis=1
)

Many thanks in advance!

Aneema
  • 151
  • 11
  • 1
    How do you define this ```inside``` you have single value for both ```lon``` and ```lat``` are you looking for the closest match, or do you have somewhere ```from Lon_start to lon_end``` and ```from lat_start to lat_end```? – Grzegorz Skibinski Feb 29 '20 at 19:51
  • 1
    With a single point for each borough's coordinates it doesn't make sense to say that another point is "inside" of the borough. You *could* try a simple closest linear distance to the single point that defines a borough, but that won't be accurate. Likely you should find [shapefiles for the boroughs](https://geo.nyu.edu/catalog/nyu-2451-34154) (NYU provides these luckily) and use [GeoPandas](https://geopandas.org/) to determine whether or not those coordinates fall within the boundaries. – ALollz Feb 29 '20 at 19:51
  • @GrzegorzSkibinski no I don't have any `lon_max and lon_min ` or `lat_max and lat_min ` to delimitate a borough's polygon. But with two coordinates isn't possible to get the borough? @ALollz Wow, thank you for your strategy. It seems quite tricky for a beginner like me but I will try it. – Aneema Feb 29 '20 at 20:01

2 Answers2

3

Try this :

from math import cos, asin, sqrt
import pandas as pd

def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295
    a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p)*cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
    return 12742 * asin(sqrt(a))

def closest(data, v):
    return min(data, key=lambda p: distance(v[0], v[1], p[0], p[1]))


df = pd.DataFrame(
    [
     {'borough': 'bronx', 'lat': 40.79, 'long': -73.78}, 
     {'borough': 'manhattan', 'lat': 40.78, 'long': -73.90},
     {'borough': None, 'lat': 40.57, 'long': -74.11}
     ],
)


borough_dict = {"Bronx" : [40.837048, -73.865433], "Brooklyn" : [40.650002, -73.949997], "Manhattan" : [40.758896, -73.985130], "Queens" : [40.742054,-73.769417], "Staten Island" : [40.579021,-74.151535]}
boroughs = [(*value, key) for key, value in borough_dict.items()]


df['borough'] = df.apply(
lambda row: closest(boroughs, [row['lat'], row['long']])[2] if row['borough'] is None else row['borough'],
axis=1
)

print(df)

Output:

         borough    lat   long
0          bronx  40.79 -73.78
1      manhattan  40.78 -73.90
2  Staten Island  40.57 -74.11

Credit to @trincot answer

adnanmuttaleb
  • 3,388
  • 1
  • 29
  • 46
  • Wow! I am absolutely super-stunned! Just a little last answer: what does it mean the `*` in `boroughs = [(*value, key) for key, value in borough_dict.items()] `? – Aneema Feb 29 '20 at 20:36
  • 1
    @Aneema I used it to unpack the list elements, e.g : `(*[40.78 -73.90], 'bronx' ) = (40.78, -73.90, 'bronx' )`, I hope it is clear. – adnanmuttaleb Feb 29 '20 at 20:39
1

You want a spatial join, so use the very closely related GeoPandas library. We'll convert your original DataFrame to a GeoDataFrame so that we can merge. Also note in your example your Latitude and Longitude columns are incorrectly labeled. I fixed that here.

import pandas as pd
import geopandas as gpd

dfg = gpd.GeoDataFrame(df.copy(), geometry=gpd.points_from_xy(df.Longitude, df.Latitude))
#         borough   Latitude Longitude                    geometry
#0          bronx      40.79    -73.78  POINT (-73.78000 40.79000)
#1      manhattan      40.78    -73.90  POINT (-73.90000 40.78000)
#2  staten island      40.84    -73.95  POINT (-73.95000 40.84000)
#3            NaN      40.57    -74.11  POINT (-74.11000 40.57000)

# Shapefile from https://geo.nyu.edu/catalog/nyu-2451-34154 
# I downloaded the geojson
df_nys = gpd.read_file('nyu-2451-34154-geojson.json')

dfg.crs = df_nys.crs  # Set coordinate reference system to be the same

dfg = gpd.sjoin(dfg, df_nys[['geometry', 'boroname']], how='left', op='within')

         borough   Latitude Longitude                    geometry  index_right       boroname
0          bronx      40.79    -73.78  POINT (-73.78000 40.79000)          4.0         Queens
1      manhattan      40.78    -73.90  POINT (-73.90000 40.78000)          4.0         Queens
2  staten island      40.84    -73.95  POINT (-73.95000 40.84000)          NaN            NaN
3            NaN      40.57    -74.11  POINT (-74.11000 40.57000)          2.0  Staten Island

So now you could fill missing 'borough's with 'boroname'. But it does seem like a few of the other points are miss-classified. This is mostly because you don't have enough precision on your stored Latitude and Longitude. Though this would probably be the more accurate solution with better precision on Lat/Lon, I might favor the distance calculation by @adnanmuttaleb given the level of precision you have in your data.

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • thank you so much for your sharing! Before your answer, I was ignoring GeoPandas library. I love your approach too! – Aneema Feb 29 '20 at 20:43
  • 1
    Yeah it can be useful with spatial data. But NYS is a very dense location so given only 2 digits on Lat/Lon it might not be the best solution here. – ALollz Feb 29 '20 at 20:44