2

I Have read a CSV file (that have addresses of customers) and assign the data into DataFrame table.

Description of the csv file (or the DataFrame table)

DataFrame contains several rows and 5 columns

Database example

 Address1             Address3 Post_Code   City_Name                           Full_Address
 10000009    37 RUE DE LA GARE    L-7535      MERSCH       37 RUE DE LA GARE,L-7535, MERSCH
 10000009    37 RUE DE LA GARE    L-7535      MERSCH       37 RUE DE LA GARE,L-7535, MERSCH
 10000009    37 RUE DE LA GARE    L-7535      MERSCH       37 RUE DE LA GARE,L-7535, MERSCH
 10001998  RUE EDWARD STEICHEN    L-1855  LUXEMBOURG  RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
 11000051       9 RUE DU BRILL    L-3898       FOETZ           9 RUE DU BRILL,L-3898 ,FOETZ

I have written a code (Geocode with Python) inorder to convert physical addresses to Geographic locations → Latitude and Longitude, but the code keep showing several errors

So far I have written this code :

The code is

import pandas as pd
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

# Read the CSV, by the way the csv file contains 43 columns
ERP_Data = pd.read_csv("test.csv")  

# Extracting the address information into a new DataFrame
Address_info= ERP_Data[['Address1','Address3','Post_Code','City_Name']].copy()

# Adding a new column called (Full_Address) that concatenate address columns into one 
# for example   Karlaplan 13,115 20,STOCKHOLM,Stockholms län, Sweden
Address_info['Full_Address'] = Address_info[Address_info.columns[1:]].apply(
lambda x: ','.join(x.dropna().astype(str)), axis=1)

locator = Nominatim(user_agent="myGeocoder")  # holds the Geocoding service, Nominatim

# 1 - conveneint function to delay between geocoding calls
geocode = RateLimiter(locator.geocode, min_delay_seconds=1) 

# 2- create location column
Address_info['location'] = Address_info['Full_Address'].apply(geocode)

# 3 - create longitude, laatitude and altitude from location column (returns tuple)
Address_info['point'] = Address_info['location'].apply(lambda loc: tuple(loc.point) if loc else None)
# 4 - split point column into latitude, longitude and altitude columns
Address_info[['latitude', 'longitude', 'altitude']] =   pd.DataFrame(Address_info['point'].tolist(), index=Address_info.index)

# using Folium to map out the points we created

folium_map = folium.Map(location=[49.61167,6.13], zoom_start=12,)

An example of the full output error is :

RateLimiter caught an error, retrying (0/2 tries). Called with (*('44 AVENUE JOHN FITZGERALD KENNEDY,L-1855,LUXEMBOURG',), **{}).
Traceback (most recent call last):
  File "e:\Anaconda3\lib\urllib\request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "e:\Anaconda3\lib\http\client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "e:\Anaconda3\lib\http\client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "e:\Anaconda3\lib\http\client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "e:\Anaconda3\lib\http\client.py", line 1026, in _send_output
    self.send(msg)
  File "e:\Anaconda3\lib\http\client.py", line 966, in send
    self.connect()
  File "e:\Anaconda3\lib\http\client.py", line 1414, in connect
    server_hostname=server_hostname)
  File "e:\Anaconda3\lib\ssl.py", line 423, in wrap_socket
    session=session
  File "e:\Anaconda3\lib\ssl.py", line 870, in _create
    self.do_handshake()
  File "e:\Anaconda3\lib\ssl.py", line 1139, in do_handshake
    self._sslobj.do_handshake()
socket.timeout: _ssl.c:1059: The handshake operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "e:\Anaconda3\lib\site-packages\geopy\geocoders\base.py", line 355, in _call_geocoder
    page = requester(req, timeout=timeout, **kwargs)
  File "e:\Anaconda3\lib\urllib\request.py", line 525, in open
    response = self._open(req, data)
  File "e:\Anaconda3\lib\urllib\request.py", line 543, in _open
    '_open', req)
  File "e:\Anaconda3\lib\urllib\request.py", line 503, in _call_chain
    result = func(*args)
  File "e:\Anaconda3\lib\urllib\request.py", line 1360, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "e:\Anaconda3\lib\urllib\request.py", line 1319, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error _ssl.c:1059: The handshake operation timed out>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "e:\Anaconda3\lib\site-packages\geopy\extra\rate_limiter.py", line 126, in __call__
    return self.func(*args, **kwargs)
  File "e:\Anaconda3\lib\site-packages\geopy\geocoders\osm.py", line 387, in geocode
    self._call_geocoder(url, timeout=timeout), exactly_one
  File "e:\Anaconda3\lib\site-packages\geopy\geocoders\base.py", line 378, in _call_geocoder
    raise GeocoderTimedOut('Service timed out')
geopy.exc.GeocoderTimedOut: Service timed out

Expected output is

    Address1      Address3        Post_Code   City_Name     Full_Address                      Latitude              Longitude  
    10000009    37 RUE DE LA GARE  L-7535     MERSCH       37 RUE DE LA GARE,L-7535, MERSCH  49.7508296                6.1085476                                                                            
    10000009    37 RUE DE LA GARE  L-7535     MERSCH       37 RUE DE LA GARE,L-7535, MERSCH  49.7508296                6.1085476
    10000009    37 RUE DE LA GARE  L-7535     MERSCH       37 RUE DE LA GARE,L-7535, MERSCH  49.7508296                6.1085476                                            
    10001998    RUE EDWARD STEICHEN L-1855  LUXEMBOURG   RUE EDWARD STEICHEN,L-1855,LUXEMBOURG 49.6302147              6.1713374                                        
    11000051    9 RUE DU BRILL      L-3898   FOETZ       9 RUE DU BRILL,L-3898 ,FOETZ         49.5217917               6.0101385
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Wisam hasan
  • 187
  • 1
  • 3
  • 13
  • 1
    `Nominatim` is the geocoder for OpenStreetMap. You have to use it as described in the [documentation](https://geopy.readthedocs.io/en/stable/#nominatim) or use a different [`geocoder`](https://geopy.readthedocs.io/en/stable/#module-geopy.geocoders) – Trenton McKinney Oct 17 '19 at 21:20
  • Yes, it took forever. But the full TraceBack is helpful. It shows the issues isn't how the method is being called but the throughput of the service. It's free, so you get what you pay for. It might be better to use one of the alternate `geocoders` from my first comment. I'll look into seeing how to increase the `socket.timeout` – Trenton McKinney Oct 18 '19 at 01:47

1 Answers1

3

I've updated your code:

  • Added: Address_info = Address_info.apply(lambda x: x.str.strip(), axis=1)
    • Removes whitespace before and after str
  • Added a function with try-except, to handle the lookup
from geopy.exc import GeocoderTimedOut, GeocoderQuotaExceeded
import time

ERP_Data = pd.read_csv("test.csv") 

# Extracting the address information into a new DataFrame
Address_info= ERP_Data[['Address1','Address3','Post_Code','City_Name']].copy()

# Clean existing whitespace from the ends of the strings
Address_info = Address_info.apply(lambda x: x.str.strip(), axis=1)  # ← added

# Adding a new column called (Full_Address) that concatenate address columns into one 
# for example   Karlaplan 13,115 20,STOCKHOLM,Stockholms län, Sweden
Address_info['Full_Address'] = Address_info[Address_info.columns[1:]].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1)

locator = Nominatim(user_agent="myGeocoder")  # holds the Geocoding service, Nominatim

# 1 - convenient function to delay between geocoding calls
# geocode = RateLimiter(locator.geocode, min_delay_seconds=1)

def geocode_me(location):
    time.sleep(1.1)
    try:
        return locator.geocode(location)
    except (GeocoderTimedOut, GeocoderQuotaExceeded) as e:
        if GeocoderQuotaExceeded:
            print(e)
        else:
            print(f'Location not found: {e}')
            return None

# 2- create location column
Address_info['location'] = Address_info['Full_Address'].apply(lambda x: geocode_me(x))  # ← note the change here

# 3 - create longitude, latitude and altitude from location column (returns tuple)
Address_info['point'] = Address_info['location'].apply(lambda loc: tuple(loc.point) if loc else None)

# 4 - split point column into latitude, longitude and altitude columns
Address_info[['latitude', 'longitude', 'altitude']] =   pd.DataFrame(Address_info['point'].tolist(), index=Address_info.index)

Output:

 Address1                Address3 Post_Code   City_Name                             Full_Address                                                                                                                                    location                         point   latitude  longitude  altitude
 10000009       37 RUE DE LA GARE    L-7535      MERSCH          37 RUE DE LA GARE,L-7535,MERSCH                                                          (Rue de la Gare, Mersch, Canton Mersch, 7535, Lëtzebuerg, (49.7508296, 6.1085476))  (49.7508296, 6.1085476, 0.0)  49.750830   6.108548       0.0
 10000009       37 RUE DE LA GARE    L-7535      MERSCH          37 RUE DE LA GARE,L-7535,MERSCH                                                          (Rue de la Gare, Mersch, Canton Mersch, 7535, Lëtzebuerg, (49.7508296, 6.1085476))  (49.7508296, 6.1085476, 0.0)  49.750830   6.108548       0.0
 10000009       37 RUE DE LA GARE    L-7535      MERSCH          37 RUE DE LA GARE,L-7535,MERSCH                                                          (Rue de la Gare, Mersch, Canton Mersch, 7535, Lëtzebuerg, (49.7508296, 6.1085476))  (49.7508296, 6.1085476, 0.0)  49.750830   6.108548       0.0
 10001998     RUE EDWARD STEICHEN    L-1855  LUXEMBOURG    RUE EDWARD STEICHEN,L-1855,LUXEMBOURG  (Rue Edward Steichen, Grünewald, Weimershof, Neudorf-Weimershof, Luxembourg, Canton Luxembourg, 2540, Lëtzebuerg, (49.6302147, 6.1713374))  (49.6302147, 6.1713374, 0.0)  49.630215   6.171337       0.0
 11000051          9 RUE DU BRILL    L-3898       FOETZ              9 RUE DU BRILL,L-3898,FOETZ                                             (Rue du Brill, Mondercange, Canton Esch-sur-Alzette, 3898, Luxembourg, (49.5217917, 6.0101385))  (49.5217917, 6.0101385, 0.0)  49.521792   6.010139       0.0
 10000052  3 RUE DU PUITS  ROMAIN    L-8070   BERTRANGE  3 RUE DU PUITS  ROMAIN,L-8070,BERTRANGE                              (Rue du Puits Romain, Z.A. Bourmicht, Bertrange, Canton Luxembourg, 8070, Lëtzebuerg, (49.6084531, 6.0771901))  (49.6084531, 6.0771901, 0.0)  49.608453   6.077190       0.0

Note & Additional Resources:

  • The output includes the address that caused the error in your TraceBack
    • RateLimiter caught an error, retrying (0/2 tries). Called with (*('3 RUE DU PUITS ROMAIN ,L-8070 ,BERTRANGE ',)
    • Note all the extra whitespace in the address. I've added a line of code to remove whitespace from the beginning and end of the strings
  • GeocoderTimedOut, a real pain?
  • Geopy: catch timeout error

Final:

  • The final result is the service times out because of HTTP Error 429: Too Many Requests for the day.
  • Review Nominatim Usage Policy
  • Suggestion: Use a different Geocoder
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
  • Dear Trenton, again many thanks for your contribution but, unfortunately, both codes (i.e., mine and yours) doesnt solve the propblem. I guess this is because the csv file (which I have) contains several inputs hence causing the error (i.e., the csv file contains around 5000 rows, which means 10000 address requests) . I have run my code as well as your , they both working with limited amount of input (limited amount of rows or addresses). but when I increse the input both codes show error – Wisam hasan Oct 17 '19 at 23:46
  • Dear Trenton, its so kind of you to keep offering suggestions and solutions. Well I cant because data collected in real live and dont want to change it, do you have an idea that doesnt consider the amount of requests. – Wisam hasan Oct 17 '19 at 23:58
  • @Wisamhasan I've added a function to make the location call and embedded it in a try-except block. I've also run into the issue of `HTTP Error 429: Too Many Requests`. So you'll know why it's failing now. – Trenton McKinney Oct 18 '19 at 04:12
  • Dear Trenton, thanks again for your help and support. at the moment the code still showing errors. (i.e., Service timed out), and unfortunately in several cases the attached code is not converting the physical addresses to geographic locations (although the addresses are accurate). For example, this address sometimes is recogntized and somtimes are not (37 RUE DE LA GARE,L-7535,MERSCH) , another example could be (32 RUE DU LUXEMBOURG,L-7733,COLMAR-BERG) and so on. Sorry for bothering you again but the last version of your code is much better and can show some results even for the big files – Wisam hasan Oct 18 '19 at 15:05
  • @Wisamhasan Yes, the try except will show the errors, but continue to run. I included a final note at the bottom. The timeout seems to occur because of making too many requests. That has to do with the service. I couldn't find where it states the number of requests per day, but the terms of service was pretty blunt about not allowing large batch processing. Try to figure out how many requests are allowed per day and then only process that many addresses or find a service that allows higher volume of requests. – Trenton McKinney Oct 18 '19 at 15:10
  • I noticed that if you process too many requests at a time, your IP address gets blocked from making further requests. Last night I was blocked, but this morning I can make more requests. – Trenton McKinney Oct 18 '19 at 15:12
  • I see, that make sense now, thank you so much.. How about # using Folium to map out the points we created – Wisam hasan Oct 18 '19 at 15:31
  • @Wisamhasan In the data you provided me, there are almost 5000 rows, but there are only 397 unique addresses. The address that don't return any information aren't in the database. For example, if I just do `locator.geocode('12 GRAND RUE,L-1660,LUXEMBOURG')`, nothing is returned. Of the 397 unique addresses, `None` is returned for 161 of them. – Trenton McKinney Oct 18 '19 at 18:32