4

I have a pandas dataframe with 2 columns (snippet below). I'm trying to use the City column to infer the Borough (you'll notice some Unspecified values that need to be replaced). To do this, I'm trying to show for each city the highest occurring Borough and output to a dictionary where the key would be the city and the value would be the highest occurring borough for that city.

City        Borough

Brooklyn    Brooklyn
Astoria     Queens
Astoria     Unspecified
Ridgewood   Unspecified
Ridgewood   Queens

So if Ridgewood is found to be paired with Queens 100 times, Brooklyn 4 times and Manhattan 1 time, the pair would be Ridgewood : Queens.

So far I've tried this code:

specified = data[['Borough','City']][data['Borough']!= 'Unspecified']
paired = specified.Borough.groupby(specified.City).max()

At first glance, this seemed like the correct output, but after closer inspection, the output isn't correct at all. Any ideas?

EDIT:

Tried the suggestion below: paired = specified.groupby('City').agg(lambda x: stats.mode(x['Borough'])[0])

I noticed some of the Boroughs come out truncated as shown below:

paired.Borough.value_counts()

#[Out]# QUEENS           58
#[Out]# MANHATTAN         7
#[Out]# STATEN ISLAND     4
#[Out]# BRONX             4
#[Out]# BROOKLYN          3
#[Out]# MANHATTA          2
#[Out]# STATE             1
#[Out]# QUEEN             1
#[Out]# MANHA             1
#[Out]# BROOK             1

Of course I can just manually replace the truncated words, but I'm curious to know what the cause is?

PS - Here's the output of the DF specified FYI:

specified
#[Out]# <class 'pandas.core.frame.DataFrame'>
#[Out]# Int64Index: 719644 entries, 1 to 396225
#[Out]# Data columns:
#[Out]# Borough    719644  non-null values
#[Out]# City       651617  non-null values
#[Out]# dtypes: object(2)

specified.Borough.value_counts()
#[Out]# QUEENS           215382
#[Out]# BROOKLYN         208565
#[Out]# MANHATTAN        150016
#[Out]# BRONX             94648
#[Out]# STATEN ISLAND     51033
ChrisArmstrong
  • 2,491
  • 8
  • 37
  • 60

1 Answers1

7

I believe this will do it:

from scipy import stats
d.groupby('City').agg(lambda x: stats.mode(x['Borough'])[0])

This gives you a DataFrame with the City as the index and the most frequent borough in the Borough column:

>>> d
         City      Borough
0    Brooklyn     Brooklyn
1     Astoria       Queens
2     Astoria       Queens
3     Astoria     Brooklyn
4     Astoria  Unspecified
5   Ridgewood  Unspecified
6   Ridgewood       Queens
7   Ridgewood       Queens
8   Ridgewood     Brooklyn
9   Ridgewood     Brooklyn
10  Ridgewood     Brooklyn
>>> d.groupby('City').agg(lambda x: stats.mode(x['Borough'])[0])
             Borough
City               
Astoria      Queens
Brooklyn   Brooklyn
Ridgewood  Brooklyn

(If you don't have scipy installed you'll have to make your own "mode" function, which I guess you could do using collections.Counter. But if you're using pandas it's a good bet you've got Scipy as well.)

BrenBarn
  • 242,874
  • 37
  • 412
  • 384