0

I have a pandas dataframe with the following relevant features: location (city), lat, long.

I want to add to this pandas dataframe the following columns in the most efficient manner:

  1. Distance to closest hospital
  2. Distance to closest train station
  3. Etc

My dataframe looks like this:

Column1 postal_code city_name   type_of_property    price   number_of_rooms house_area  fully_equipped_kitchen  open_fire   terrace garden  surface_of_the_land number_of_facades   swimming_pool   state_of_the_building   lattitude   longitude   province    region
13380   13380   1785    Brussegem   1   235000  1   71  1   0   1   0   0   4   0   good    4.265887    50.927771   Brabant flamand Flandre
21135   21135   8630    Bulskamp    1   545000  3   191 1   0   0   0   0   2   0   as new  2.643449    51.044461   Flandre-Occidentale Flandre
5002    5002    4287    Lincent 0   97500   3   90  0   0   0   0   260 4   0   to renovate 5.031817    50.711613   Liège   Wallonie
9544    9544    8400    Oostende    0   130000  3   119 1   0   0   1   71  2   0   to renovate 2.920327    51.230318   Flandre-Occidentale Flandre
38923   38923   4830    Limbourg    0   149000  2   140 1   0   0   0   15  2   0   to be done up   5.940299    50.612276   Liège   Walloni

I found this python package which allows me to find places near by:

https://github.com/slimkrazy/python-google-places

So I created 2 function:

  1. To calculate distance between 2 points (geodesic distance)

     import geopy
     from googleplaces import GooglePlaces, types, lang, ranking
     import geopy.distance
    
     def geodesicdistance(start,end):
         return print(geopy.distance.geodesic(start, end).km)
    
  2. Function to get nearby places to a start point (hospitals, train stations, etc) It all depends on the type parameter

     def getplace(start, location,  type):
             YOUR_API_KEY = ''
             google_places = GooglePlaces(YOUR_API_KEY)
    
             query_result = google_places.nearby_search(
                 lat_lng=start,
                 location=location,
                 radius=500,
                 types=[type],
                 rankby=ranking.DISTANCE 
                 )
    
             return (query_result.places[0].geo_location['lng'], query_result.places[0].geo_location['lat'])
    

the type is an enum with the following relevant values for me:

TYPE_BUS_STATION, TYPE_HOSPITAL, TYPE_AIRPORT, TYPE_BAKERY,TYPE_CITY_HALL, TYPE_FIRE_STATION, TYPE_PHARMACY, TYPE_POLICE,TYPE_SUBWAY_STATION,TYPE_TRAIN_STATION,TYPE_UNIVERSITY

11 types.

Basically I want to create 11 new dynamic columns as: Distance to closes bus station, Distance to closes hospital, Distance to closest airport, etc, etc.

The part I am missing and I dont have a clue how to do it, its how to iterate on the pandas dataframe and per each row, call the functions 11 times and store it as a new value on the corresponding column

Update: I tried the code below but still got an error. (I changed location by city_name)

import geopy
from googleplaces import GooglePlaces, types, lang, ranking
import geopy.distance
   
# Corrected geodesicdistance so it returns correct result rather than None
def geodesicdistance(start,end):
    result = geopy.distance.geodesic(start, end).km
    #print(result)                    # uncommment to print
    return result 

# Refactored getplace to getplaces to provide distance to items in type_list
def getplaces(start, 
              location, 
              types):  
    '''
        Get distance to nearby item based for types
    '''
    
    YOUR_API_KEY = ''
    google_places = GooglePlaces('')

    # Compute the closest distance for each type
    start_lat_lng = (start['lat'], start['lng'])
    distances = []

    for type_ in types:
        # Find closest place for type_
        query_result = google_places.nearby_search(
                         lat_lng = start,            # A dict containing the following keys: lat, lng (default None)
                         location = location,        # A human readable location, e.g 'London, England' (default None)
                         radius = 500, 
                         types = [type_],
                         rankby = ranking.DISTANCE 
                         )
        # 0th element hast lat/lng for closest since we ranked by distance
        end_lat_lng = query_result.places[0].geo_location['lat'], query_result.places[0].geo_location['lng']
        
        # Add distance to closest to list by type
        distances.append(geodesicdistance(start_lat_lng , end_lat_lng ))
        
    return distances

def min_dist(lattitude, longitude, location, types):
    '''
        Get distances for places closest to latitude/longitude

       latitude/longitude   - position to find nearby items from
       location             - human-readable name of the position
       types                - list of places to find by type
    '''
    start = {'lat': lattitude,
             'lng': longitude}
    
    return getplaces(start, location, types)


# List of 11 types used from googleplaces.types
types_list = [types.TYPE_BUS_STATION, types.TYPE_HOSPITAL, types.TYPE_AIRPORT, types.TYPE_BAKERY,
                       types.TYPE_CITY_HALL, types.TYPE_FIRE_STATION, types.TYPE_PHARMACY, 
                       types.TYPE_POLICE,types.TYPE_SUBWAY_STATION, types.TYPE_TRAIN_STATION, types.TYPE_UNIVERSITY]

# Create Dataframe whose columns have the distances to closest item by type
closest_df = df.apply(lambda row: pd.Series(min_dist(row['lattitude'], 
                                                     row['longitude'], 
                                                     row['city_name'],
                                                     types_list), 
                                             index = types_list), 
                      axis = 'columns',
                      result_type ='expand')

# Concatenate the two dfs columnwise
final_df = pd.concat([df, closest_df], axis = 1)

but I have this error:

IndexError                                Traceback (most recent call last)
/home/azureuser/cloudfiles/code/Users/levm38/LightGBM/Tests/featureengineering.ipynb Cell 2 in <cell line: 63>()
     58 types_list = [types.TYPE_BUS_STATION, types.TYPE_HOSPITAL, types.TYPE_AIRPORT, types.TYPE_BAKERY,
     59                        types.TYPE_CITY_HALL, types.TYPE_FIRE_STATION, types.TYPE_PHARMACY, 
     60                        types.TYPE_POLICE,types.TYPE_SUBWAY_STATION, types.TYPE_TRAIN_STATION, types.TYPE_UNIVERSITY]
     62 # Create Dataframe whose columns have the distances to closest item by type
---> 63 closest_df = df.apply(lambda row: pd.Series(min_dist(row['lattitude'], 
     64                                                      row['longitude'], 
     65                                                      row['city_name'],
     66                                                      types_list), 
     67                                              index = types_list), 
     68                       axis = 'columns',
     69                       result_type ='expand')
     71 # Concatenate the two dfs columnwise
     72 final_df = pd.concat([df, closest_df], axis = 1)

File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/frame.py:8845, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
   8834 from pandas.core.apply import frame_apply
   8836 op = frame_apply(
   8837     self,
   8838     func=func,
   (...)
   8843     kwargs=kwargs,
   8844 )
...
---> 36 end_lat_lng = query_result.places[0].geo_location['lat'], query_result.places[0].geo_location['lng']
     38 # Add distance to closest to list by type
     39 distances.append(geodesicdistance(start_lat_lng , end_lat_lng ))

IndexError: list index out of range

Update 2:

The code runs but I always get NaNs, even if I increase the ratious to 5000 mt instead of 500, I always get NaNs in all columns

The only thing I did is change the location by city_name, I hardcoded ,+"Belgium" on the location parameter to the cityname.

See dataset here: https://raw.githubusercontent.com/Joffreybvn/real-estate-data-analysis/master/data/clean/belgium_real_estate.csv

Updated Code:

import geopy
from googleplaces import GooglePlaces, types, lang, ranking
import geopy.distance

import numpy as np
import pandas as pd

# Corrected geodesicdistance so it returns correct result rather than None
def geodesicdistance(start,end):
    result = geopy.distance.geodesic(start, end).km
    #print(result)                    # uncommment to print
    return result 

# Refactored getplace to getplaces to provide distance to items in type_list
def getplaces(start, 
              location, 
              types):  
    '''
        Get distance to nearby item based for types
    '''
    
    YOUR_API_KEY = ''
    google_places = GooglePlaces(YOUR_API_KEY)

    # Compute the closest distance for each type
    start_lat_lng = (start['lat'], start['lng'])
    distances = []

    for type_ in types:
        # Find closest place for type_
        query_result = google_places.nearby_search(
                         lat_lng = start,            # A dict containing the following keys: lat, lng (default None)
                         location = location +",Belgium",        # A human readable location, e.g 'London, England' (default None)
                         radius = 5000, 
                         types = [type_],
                         rankby = ranking.DISTANCE 
                         )
        
        # 9/1/2022 --added try/except block to handle 0 responses
        try:                  
            # 0th element hast lat/lng for closest since we ranked by distance
            end_lat_lng = query_result.places[0].geo_location['lat'], query_result.places[0].geo_location['lng']

            # Add distance to closest to list by type
            distances.append(geodesicdistance(start_lat_lng , end_lat_lng ))
            
        except IndexError:
            distances.append(np.nan)             # did not return a value
        
    return distances

def min_dist(latitude, longitude, location, types):
    '''
        Get distances for places closest to latitude/longitude

       latitude/longitude   - position to find nearby items from
       location             - human-readable name of the position
       types                - list of places to find by type
    '''
    start = {'lat': latitude,                                # spelling correction 9/1/2022
             'lng': longitude}
    
    return getplaces(start, location, types)

dfsmall= df.sample(10)

types_list = [types.TYPE_BUS_STATION, types.TYPE_HOSPITAL, types.TYPE_AIRPORT, types.TYPE_BAKERY,
                       types.TYPE_CITY_HALL, types.TYPE_FIRE_STATION, types.TYPE_PHARMACY, 
                       types.TYPE_POLICE,types.TYPE_SUBWAY_STATION, types.TYPE_TRAIN_STATION, types.TYPE_UNIVERSITY]

# Create Dataframe whose columns have the distances to closest item by type
closest_df = dfsmall.apply(lambda row: pd.Series(min_dist(row['lattitude'], 
                                                     row['longitude'], 
                                                     row['city_name'],
                                                     types_list), 
                                             index = types_list), 
                      axis = 'columns',
                      result_type ='expand')

# Concatenate the two dfs columnwise
final_df = pd.concat([dfsmall, closest_df], axis = 1)
print('Final df')
display(final_df)
Luis Valencia
  • 32,619
  • 93
  • 286
  • 506
  • Unrelated to solving your problem, but in function geodesicdistance you have `return print(geopy.distance.geodesic(start, end).km)` which will always return None. You should first, set the calculation to a local variable, second, print the local variable, and then third. return the value of the local variable. – DarrylG Aug 26 '22 at 11:17
  • how about `df[new_col]=df['some_col'].apply(your_function)`? – Rafaó Aug 26 '22 at 11:29

1 Answers1

1

Using pandas.DataFrame.apply with result_type = 'expand' to create columns by distance.

Code

import geopy
from googleplaces import GooglePlaces, types, lang, ranking
import geopy.distance

import numpy as np
import pandas as pd

# Corrected geodesicdistance so it returns correct result rather than None
def geodesicdistance(start,end):
    result = geopy.distance.geodesic(start, end).km
    #print(result)                    # uncommment to print
    return result 

# Refactored getplace to getplaces to provide distance to items in type_list
def getplaces(start, 
              location, 
              types):  
    '''
        Get distance to nearby item based for types
    '''
    
    YOUR_API_KEY = 'YOUR API KEY'
    google_places = GooglePlaces(YOUR_API_KEY)

    # Compute the closest distance for each type
    start_lat_lng = (start['lat'], start['lng'])
    distances = []

    for type_ in types:
        # Find closest place for type_
        query_result = google_places.nearby_search(
                         lat_lng = start,            # A dict containing the following keys: lat, lng (default None)
                         location = location,        # A human readable location, e.g 'London, England' (default None)
                         radius = 500, 
                         types = [type_],
                         rankby = ranking.DISTANCE 
                         )
        
        # 9/1/2022 --added try/except block to handle 0 responses
        try:                  
            # 0th element hast lat/lng for closest since we ranked by distance
            end_lat_lng = query_result.places[0].geo_location['lat'], query_result.places[0].geo_location['lng']

            # Add distance to closest to list by type
            distances.append(geodesicdistance(start_lat_lng , end_lat_lng ))
            
        except IndexError:
            distances.append(np.nan)             # did not return a value
        
    return distances

def min_dist(latitude, longitude, location, types):
    '''
        Get distances for places closest to latitude/longitude

       latitude/longitude   - position to find nearby items from
       location             - human-readable name of the position
       types                - list of places to find by type
    '''
    start = {'lat': latitude,                                # spelling correction 9/1/2022
             'lng': longitude}
    
    return getplaces(start, location, types)

Example Usage

# Create Dataframe using central areas of three cities in the USA
s = '''location latitude    longitude
NYC 40.754101   -73.992081
Chicago 41.882702   -87.619392
Atlanta 33.753746   -84.386330'''

df = pd.read_csv(StringIO(s), sep = '\t')
print('Initial df')
display(df)

# List of 11 types used from googleplaces.types
types_list = [types.TYPE_BUS_STATION, types.TYPE_HOSPITAL, types.TYPE_AIRPORT, types.TYPE_BAKERY,
                       types.TYPE_CITY_HALL, types.TYPE_FIRE_STATION, types.TYPE_PHARMACY, 
                       types.TYPE_POLICE,types.TYPE_SUBWAY_STATION, types.TYPE_TRAIN_STATION, types.TYPE_UNIVERSITY]

# Create Dataframe whose columns have the distances to closest item by type
closest_df = df.apply(lambda row: pd.Series(min_dist(row['latitude'], 
                                                     row['longitude'], 
                                                     row['location'],
                                                     types_list), 
                                             index = types_list), 
                      axis = 'columns',
                      result_type ='expand')

# Concatenate the two dfs columnwise
final_df = pd.concat([df, closest_df], axis = 1)
print('Final df')
display(final_df)

Output

initial df

    location    latitude    longitude
0   NYC 40.754101   -73.992081
1   Chicago 41.882702   -87.619392
2   Atlanta 33.753746   -84.386330

final_df

    location    latitude    longitude   bus_station hospital    airport bakery  city_hall   fire_station    pharmacy    police  subway_station  train_station   university
0   NYC 40.754101   -73.992081  0.239516    0.141911    0.196990    0.033483    NaN 0.181210    0.106216    0.248619    0.229708    0.407709    0.035780
1   Chicago 41.882702   -87.619392  0.288502    0.442175    0.957081    0.327077    1.024382    0.467242    0.249753    0.648701    0.565269    0.478530    0.424755
2   Atlanta 33.753746   -84.386330  0.374402    0.097586    0.424375    0.232727    0.548718    0.549474    0.286725    0.250114    0.366779    0.386960    0.029468
DarrylG
  • 16,732
  • 2
  • 17
  • 23
  • please see update, a little error – Luis Valencia Sep 01 '22 at 12:57
  • I don't have a Google API key. Can you check the input to google_places.nearby_search and the output query_result? – DarrylG Sep 01 '22 at 14:34
  • @LuisValencia I'm back. Feel free to delete the comment. I'll take a look now. – DarrylG Sep 01 '22 at 21:55
  • @LuisValencia check updated answer that uses a simplified dataframe with location, latitude, longitude. You should be able to use the same code with your dataframe. – DarrylG Sep 01 '22 at 23:00
  • thanks a lot for your hardwork, but I am still having issues, I always get NaNs, I added an Update 2 to the question with all details. including the dataset. – Luis Valencia Sep 02 '22 at 08:10
  • @LuisValencia I looked up three of your coordinates from your url for your CSV data on google map i.e. [Discover coordinates or search by latitude & longitude](https://support.google.com/maps/answer/18539?hl=en&co=GENIE.Platform%3DDesktop). namely 1) Liège (5.573420, 50.645138), 2) Deurne (4.469525, 51.211528), 3) Warquignies (3.822352, 50.400264). All three were off the coast of Somalia in the Arabian Sea, so there is nothing nearby. Are you sure about the latitudes/longitudes? – DarrylG Sep 02 '22 at 11:29
  • @LuisValencia found two issues (worked after correcting them): 1) latitude/longitude columns are reversed and 2) latitude has one not two t's i.e. *latitude*. – DarrylG Sep 02 '22 at 11:40
  • 1
    @LuisValencia one fix to last comment is to rename the columns: `df = pd.read_csv(url); df.rename({'longitude':'lattitude', 'latitude':'longitude'}, inplace = True);` This swaps the names and corrects the spelling of latitude. – DarrylG Sep 02 '22 at 11:45
  • amazing help, worked like a charm! thanks a lot – Luis Valencia Sep 07 '22 at 10:02
  • @LuisValencia you're welcome. By the way, rather than the fix in my last comment, it would be better to correct the original CSV file if you are the creator of it. – DarrylG Sep 07 '22 at 10:16
  • I am not, but I can download it and modify it instead of doing the renaming, thx – Luis Valencia Sep 07 '22 at 10:23
  • 1
    @LuisValencia if you have to process other files from this source and expect to have to do similar data cleaning then I would use the software fix. If you will just be using this file only, then I would correct it manually. – DarrylG Sep 07 '22 at 10:29