Group Pandas dataframe based on highest occurring values

Question

I have a pandas dataframe with 2 columns (snippet below). I'm trying to use the City column to infer the Borough (you'll notice some Unspecified values that need to be replaced). To do this, I'm trying to show for each city the highest occurring Borough and output to a dictionary where the key would be the city and the value would be the highest occurring borough for that city.

City        Borough

Brooklyn    Brooklyn
Astoria     Queens
Astoria     Unspecified
Ridgewood   Unspecified
Ridgewood   Queens

So if Ridgewood is found to be paired with Queens 100 times, Brooklyn 4 times and Manhattan 1 time, the pair would be Ridgewood : Queens.

So far I've tried this code:

specified = data[['Borough','City']][data['Borough']!= 'Unspecified']
paired = specified.Borough.groupby(specified.City).max()

At first glance, this seemed like the correct output, but after closer inspection, the output isn't correct at all. Any ideas?

EDIT:

Tried the suggestion below: paired = specified.groupby('City').agg(lambda x: stats.mode(x['Borough'])[0])

I noticed some of the Boroughs come out truncated as shown below:

paired.Borough.value_counts()

#[Out]# QUEENS           58
#[Out]# MANHATTAN         7
#[Out]# STATEN ISLAND     4
#[Out]# BRONX             4
#[Out]# BROOKLYN          3
#[Out]# MANHATTA          2
#[Out]# STATE             1
#[Out]# QUEEN             1
#[Out]# MANHA             1
#[Out]# BROOK             1

Of course I can just manually replace the truncated words, but I'm curious to know what the cause is?

PS - Here's the output of the DF specified FYI:

specified
#[Out]# <class 'pandas.core.frame.DataFrame'>
#[Out]# Int64Index: 719644 entries, 1 to 396225
#[Out]# Data columns:
#[Out]# Borough    719644  non-null values
#[Out]# City       651617  non-null values
#[Out]# dtypes: object(2)

specified.Borough.value_counts()
#[Out]# QUEENS           215382
#[Out]# BROOKLYN         208565
#[Out]# MANHATTAN        150016
#[Out]# BRONX             94648
#[Out]# STATEN ISLAND     51033

`max` finds the largest lexicographically. – Andy Hayden Nov 19 '12 at 13:08 — Andy Hayden, Nov 19 '12 at 13:08
ah, that would explain the bizarre results... – ChrisArmstrong Nov 19 '12 at 15:05 — ChrisArmstrong, Nov 19 '12 at 15:05

score 7 · Answer 1 · answered Nov 19 '12 at 02:25

7

I believe this will do it:

from scipy import stats
d.groupby('City').agg(lambda x: stats.mode(x['Borough'])[0])

This gives you a DataFrame with the City as the index and the most frequent borough in the Borough column:

>>> d
         City      Borough
0    Brooklyn     Brooklyn
1     Astoria       Queens
2     Astoria       Queens
3     Astoria     Brooklyn
4     Astoria  Unspecified
5   Ridgewood  Unspecified
6   Ridgewood       Queens
7   Ridgewood       Queens
8   Ridgewood     Brooklyn
9   Ridgewood     Brooklyn
10  Ridgewood     Brooklyn
>>> d.groupby('City').agg(lambda x: stats.mode(x['Borough'])[0])
             Borough
City               
Astoria      Queens
Brooklyn   Brooklyn
Ridgewood  Brooklyn

(If you don't have scipy installed you'll have to make your own "mode" function, which I guess you could do using collections.Counter. But if you're using pandas it's a good bet you've got Scipy as well.)

answered Nov 19 '12 at 02:25

BrenBarn

242,874
37
412
384

this gave me an assertion error – ChrisArmstrong Nov 19 '12 at 02:32
nevermind, I did it on the 'specified' set from the other post you helped me on and it seems to have worked... – ChrisArmstrong Nov 19 '12 at 02:33
1

A strange thing--it seems like 'Manhattan' is getting truncated in some places: #[Out]# LONG ISLAND CITY QUEENS #[Out]# MANHATTAAN MANHATTAN #[Out]# MANHATTAN MANHA #[Out]# MASPETH QUEENS #[Out]# MEMPHIS QUEENS #[Out]# MIDDLE VILLAGE QUEENS #[Out]# N/A MANHATTA #[Out]# NEW BRONX #[Out]# NEW HYDE PARK QUEENS #[Out]# NEW YORK MANHA #[Out]# NEW YORK CITY MANHA #[Out]# NEWYORK MANHATTAN – ChrisArmstrong Nov 19 '12 at 02:36
@ChrisArmstrong This seems to work fine for me... – Andy Hayden Nov 19 '12 at 15:46
perhaps it's an iPython printing issue? I'll check it again tonight... – ChrisArmstrong Nov 19 '12 at 16:25
I checked and still getting the same strange output...see edit above. – ChrisArmstrong Nov 19 '12 at 21:36
`as_index=False` option can be used when you are just looking for a dataframe with two columns. – Aman Deep Gautam Nov 27 '15 at 06:29

Group Pandas dataframe based on highest occurring values

1 Answers1