0

I'm doing some real estate data cleaning and encountered this novice problem which surprisingly seems I can't resolve by my own.

I have this dataframe which has nan values in the lat and lon column. I can figure the almost correct values inputing the mean of lat and lon for the given neighborhood.

This one is an example, the actual DF has more than 20k rows.

    lat   lon    neighborhood
   -34.62 -58.50 Monte Castro
   -34.63 -58.36 Boca
    nan   nan    San Telmo

I made two dictionaries with lat and lon means for each neighborhood with the following code:

neighborhood_lat = []
neighborhood_lon = []
for neighborhood in df['l3'].unique():
    lat = df[((df['l3']==neighborhood) & (df['lat'].notnull()))].mean().lat
    lon = df[((df['l3']==neighborhood) & (df['lon'].notnull()))].mean().lon
    neighborhood_lat.append({neighborhood: lat})
    neighborhood_lon.append({neighborhood: lon})

This is part of one of those dict:

 neighborhood_lat 
 [{'Mataderos': -34.65278757721805},
 {'Saavedra': -34.551813882357166},
 {nan: nan},
 {'Boca': -34.63204552441155},
 {'Boedo': -34.62695442446412},
 {'Abasto': -34.603728937455315},
 {'Flores': -34.62757516061659},
 {'Nuñez': -34.54843158034983},
 {'Retiro': -34.595564030955934},
 {'Almagro': -34.60692879236826},
 {'Palermo': -34.58274909271148},
 {'Belgrano': -34.56304387233704},
 {'Recoleta': -34.592081482406854},
 {'Balvanera': -34.608665174550694},
 {'Caballito': -34.61749059613885}

Then I'm trying to fillna lat and lon with those dictionaries but I can't understand how to assing a condition for the fillna so it fills lat and lon according to the neighborhood lat and lon mean.

Expected results

    lat                         lon                       neighborhood
   -34.62                      -58.50                     Monte Castro
   -34.63                      -58.36                     Boca
    (mean lat of neighborhood) (mean lon of neighborhood) San Telmo

Thanks for your help.

  • You want to fill the nan with the mean of each neighbourhood, right? If that's the case, increase your data so each neighbourhood is more than once in your data. – Erfan Sep 28 '19 at 17:44
  • The actual dataset contains more than 20k rows. This is an example – Matias Hermida Sep 28 '19 at 18:10
  • Possible duplicate of [Remap values in pandas column with a dict](https://stackoverflow.com/questions/20250771/remap-values-in-pandas-column-with-a-dict) – Erfan Sep 28 '19 at 19:00
  • could be but in that case they're maping an whole column not just nan values – Matias Hermida Sep 29 '19 at 02:31

1 Answers1

0

Answering my own questions again...

I figured out the correct code to resolve the problem with help of this answer: answer

code:

creating the dictionary:

neighborhood_lat = {}
neighborhood_lon = {}

for neighborhood in df['l3'].unique():
    neighborhood_lat[neighborhood] = df[((df['l3']==neighborhood) & (df['lat'].notnull()))].mean().lat
    neighborhood_lon[neighborhood] = df[((df['l3']==neighborhood) & (df['lon'].notnull()))].mean().lon

filling the nan values with dictionary:

df['lat'] = df['lat'].fillna(df['l3'].map(neighborhood_lat))
df['lon'] = df['lon'].fillna(df['l3'].map(neighborhood_lon))