7

I have dataframe with following columns:

Name, Surname, dateOfBirth, city, country

I am interested to find what is most common combination of name and surname and how much it occurs as well. Would be nice also to see list of top 10 combinations.

My idea for top one was:

mostFreqComb= df.groupby(['Name','Surname'])['Name'].count().argmax()

But I think it is not giving me correct answer. Help would be much appreciated !

Thanks, Neb

jpp
  • 159,742
  • 34
  • 281
  • 339
  • Note on performance, including alternatives: [Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series](https://stackoverflow.com/questions/50328246/pandas-groupby-size-vs-series-value-counts-vs-collections-counter-with-multiple) – jpp Jun 25 '18 at 14:02

2 Answers2

11

For performance implications of the below solutions, see Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series. They are presented below with best performance first.

GroupBy.size

You can create a series of counts with (Name, Surname) tuple indices using GroupBy.size:

res = df.groupby(['Name', 'Surname']).size().sort_values(ascending=False)

By sorting these values, we can easily extract the most common:

most_common = res.head(1)
most_common_dups = res[res == res.iloc[0]].index.tolist()  # handles duplicate top counts

value_counts

Another way is to construct a series of tuples, then apply pd.Series.value_counts:

res = pd.Series(list(zip(df.Name, df.Surname))).value_counts()

The result will be a series of counts indexed by Name-Surname combinations, sorted from most common to least.

name, surname = res.index[0]  # return most common
most_common_dups = res[res == res.max()].index.tolist()

collections.Counter

If you wish to create a dictionary of (name, surname): counts entries, you can do so via collections.Counter:

from collections import Counter

zipper = zip(df.Name, df.Surname)
c = Counter(zipper)

Counter has useful methods such as most_common, which you can use to extract your result.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • Hmm, seems like you ninja-edited me there. Well, I'd recommend putting Counter at the top, it's the best part of your answer thus far. – cs95 May 12 '18 at 20:14
  • @cᴏʟᴅsᴘᴇᴇᴅ, I disagree.. All a matter of opinion :). For example, the syntax for getting duplicate top counts from a `Counter` is cumbersome [comprehension] versus `pandas` series indexing. If performance is not an issue, I'd recommend `value_counts` – jpp May 12 '18 at 21:51
  • I'm not sure I follow. It seems like OP just wants the top 10 most common values here. Why does it matter if there are duplicate counts or not? By the way, I'll note none of your options currently give OP what they asked for. – cs95 May 13 '18 at 01:07
  • @cᴏʟᴅsᴘᴇᴇᴅ, What if the 10th count is repeated? So the 10th most popular count is 100 and there are 3 name-surname combinations with 100 count? `nlargest` with series indexing is much cleaner syntactically than a comprehension. [Look here](https://stackoverflow.com/questions/26830929/collections-counter-most-common-including-equal-counts) for the IMHO ugly alternatives with `Counter`. – jpp May 13 '18 at 08:57
  • 1
    I get what you're saying may be important to some users, but I doubt that was the point of a "I want the top 10" from this OP. Really, I see your point though, Counter really should have more than one function that let's for a cleaner interaction with their API. – cs95 May 13 '18 at 11:20
2

Seems like a good use case for the performant Counter:

from collections import Counter
popular_names = Counter(zip(df.Name, df.Surname)).most_common(10) 
cs95
  • 379,657
  • 97
  • 704
  • 746