I have a pandas dataframe with 2 columns (snippet below). I'm trying to use the City column to infer the Borough (you'll notice some Unspecified values that need to be replaced). To do this, I'm trying to show for each city the highest occurring Borough and output to a dictionary where the key would be the city and the value would be the highest occurring borough for that city.
City Borough
Brooklyn Brooklyn
Astoria Queens
Astoria Unspecified
Ridgewood Unspecified
Ridgewood Queens
So if Ridgewood is found to be paired with Queens 100 times, Brooklyn 4 times and Manhattan 1 time, the pair would be Ridgewood : Queens.
So far I've tried this code:
specified = data[['Borough','City']][data['Borough']!= 'Unspecified']
paired = specified.Borough.groupby(specified.City).max()
At first glance, this seemed like the correct output, but after closer inspection, the output isn't correct at all. Any ideas?
EDIT:
Tried the suggestion below: paired = specified.groupby('City').agg(lambda x: stats.mode(x['Borough'])[0])
I noticed some of the Boroughs come out truncated as shown below:
paired.Borough.value_counts()
#[Out]# QUEENS 58
#[Out]# MANHATTAN 7
#[Out]# STATEN ISLAND 4
#[Out]# BRONX 4
#[Out]# BROOKLYN 3
#[Out]# MANHATTA 2
#[Out]# STATE 1
#[Out]# QUEEN 1
#[Out]# MANHA 1
#[Out]# BROOK 1
Of course I can just manually replace the truncated words, but I'm curious to know what the cause is?
PS - Here's the output of the DF specified FYI:
specified
#[Out]# <class 'pandas.core.frame.DataFrame'>
#[Out]# Int64Index: 719644 entries, 1 to 396225
#[Out]# Data columns:
#[Out]# Borough 719644 non-null values
#[Out]# City 651617 non-null values
#[Out]# dtypes: object(2)
specified.Borough.value_counts()
#[Out]# QUEENS 215382
#[Out]# BROOKLYN 208565
#[Out]# MANHATTAN 150016
#[Out]# BRONX 94648
#[Out]# STATEN ISLAND 51033