Finding the most common pair from two dataframe columns

Question

I have a dataframe of people that, among many others, has columns for the city and type of pet each person prefers. I would like to find the city where a certain pet is most prevalent.

City             Pet
New York         Cat
Washington       Dog
Boston           Dog
New York         Cat
Atlanta          Cat
New York         Dog
Atlanta          Dog
Boston           Dog

So in this case, New York would have the most cats and Boston would have the most dogs.

How do I determine which city has e.g., the most cats for my larger dataframe?

Possible duplicate of [GroupBy pandas DataFrame and select most common value](https://stackoverflow.com/questions/15222754/groupby-pandas-dataframe-and-select-most-common-value) — M_S_N, Jul 14 '19 at 20:54

fuglede · Answer 1 · 2019-07-15T07:51:10.317

If you want the full list sorted by the number of cats, you could do the following:

In [38]: df.groupby('City').Pet.value_counts().unstack().sort_values(by='Cat', ascending=False)
Out[38]:
Pet         Cat  Dog
City
New York    2.0  1.0
Atlanta     1.0  1.0
Boston      NaN  2.0
Washington  NaN  1.0

If you only want the largest one, you can use nlargest:

In [45]: df.groupby('City').Pet.value_counts().unstack().nlargest(1, 'Cat')
Out[45]:
Pet      Cat  Dog
City
New York 2.0  1.0

Or you could do the same but focus on cats from the beginning:

In [62]: df[df.Pet == 'Cat'].groupby('City').count().nlargest(1, 'Pet')
Out[62]:
         Pet
City
NewYork    2

If you don't care about the actual numbers and just want the cities, you can use idxmax:

In [51]: df.groupby('City').Pet.value_counts().unstack().idxmax()
Out[51]:
Pet
Cat   New York
Dog     Boston

If you want to do what the last example does, but only focus on cats, you can do that as well:

In [60]: df[df.Pet == 'Cat'].groupby('City').count().idxmax()
Out[60]:
Pet    New York

score 1 · Answer 2 · answered Jul 14 '19 at 20:59

Try this: nlargest will return the top 'n' results, in this case I set nlargest =1 so it returns the top result. I set animal = "cat", so you can change to "Dog" and see the results for dogs..

df2  = df[df["Pet"] ==animal].groupby('City').count().rename(columns={"Pet":animal}).nlargest(1, animal)

Here is an alternative if you want the search to case insensitive (would count "Cat, CAT or cat" )

df2  = df[df["Pet"].str.lower() ==animal.lower()].groupby('City').count().rename(columns={"Pet":animal}).nlargest(3, animal)

score 1 · Answer 3 · answered Jul 14 '19 at 23:21

1

We using mode

df.groupby('City').Pet.apply(lambda x : pd.Series.mode(x)[0])
City
Atlanta       Cat
Boston        Dog
NewYork       Cat
Washington    Dog
Name: Pet, dtype: object

answered Jul 14 '19 at 23:21

BENY

317,841
20
164
234

score 0 · Answer 4 · answered Jul 14 '19 at 22:10

I am sure there is a more pythonic way. But this should do what you need.

data = df.groupby(['pet','city']).city.count().to_frame() # getting the count of each pet at each city
data.columns = ['cnt']  # changing the name of the column
def set_max(series): # this is used to develop a new column that carries the maximum of that column
    return [max(series) for s in series]
data['maximum'] = data.transform(set_max) # add a column that carries maximum value
bm = data.apply(lambda x : x.maximum == x.cnt,axis=1)  # boolean mask that is true when the count=max
data.loc[bm].reset_index()[['pet','city']] # giving the results

Finding the most common pair from two dataframe columns

4 Answers4