Number of occurrence of pair of value in dataframe

Question

I have dataframe with following columns:

Name, Surname, dateOfBirth, city, country

I am interested to find what is most common combination of name and surname and how much it occurs as well. Would be nice also to see list of top 10 combinations.

My idea for top one was:

mostFreqComb= df.groupby(['Name','Surname'])['Name'].count().argmax()

But I think it is not giving me correct answer. Help would be much appreciated !

Thanks, Neb

Note on performance, including alternatives: [Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series](https://stackoverflow.com/questions/50328246/pandas-groupby-size-vs-series-value-counts-vs-collections-counter-with-multiple) — jpp, Jun 25 '18 at 14:02

jpp · Accepted Answer · 2019-01-21T20:51:49.167

11

For performance implications of the below solutions, see Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series. They are presented below with best performance first.

`GroupBy.size`

You can create a series of counts with (Name, Surname) tuple indices using GroupBy.size:

res = df.groupby(['Name', 'Surname']).size().sort_values(ascending=False)

By sorting these values, we can easily extract the most common:

most_common = res.head(1)
most_common_dups = res[res == res.iloc[0]].index.tolist()  # handles duplicate top counts

`value_counts`

Another way is to construct a series of tuples, then apply pd.Series.value_counts:

res = pd.Series(list(zip(df.Name, df.Surname))).value_counts()

The result will be a series of counts indexed by Name-Surname combinations, sorted from most common to least.

name, surname = res.index[0]  # return most common
most_common_dups = res[res == res.max()].index.tolist()

`collections.Counter`

If you wish to create a dictionary of (name, surname): counts entries, you can do so via collections.Counter:

from collections import Counter

zipper = zip(df.Name, df.Surname)
c = Counter(zipper)

Counter has useful methods such as most_common, which you can use to extract your result.

edited Jan 21 '19 at 20:51

answered May 12 '18 at 20:04

jpp

159,742
34
281
339

Hmm, seems like you ninja-edited me there. Well, I'd recommend putting Counter at the top, it's the best part of your answer thus far. – cs95 May 12 '18 at 20:14
@cᴏʟᴅsᴘᴇᴇᴅ, I disagree.. All a matter of opinion :). For example, the syntax for getting duplicate top counts from a `Counter` is cumbersome [comprehension] versus `pandas` series indexing. If performance is not an issue, I'd recommend `value_counts` – jpp May 12 '18 at 21:51
I'm not sure I follow. It seems like OP just wants the top 10 most common values here. Why does it matter if there are duplicate counts or not? By the way, I'll note none of your options currently give OP what they asked for. – cs95 May 13 '18 at 01:07
@cᴏʟᴅsᴘᴇᴇᴅ, What if the 10th count is repeated? So the 10th most popular count is 100 and there are 3 name-surname combinations with 100 count? `nlargest` with series indexing is much cleaner syntactically than a comprehension. [Look here](https://stackoverflow.com/questions/26830929/collections-counter-most-common-including-equal-counts) for the IMHO ugly alternatives with `Counter`. – jpp May 13 '18 at 08:57
1

I get what you're saying may be important to some users, but I doubt that was the point of a "I want the top 10" from this OP. Really, I see your point though, Counter really should have more than one function that let's for a cleaner interaction with their API. – cs95 May 13 '18 at 11:20

score 2 · Answer 2 · answered May 12 '18 at 20:13

2

Seems like a good use case for the performant Counter:

from collections import Counter
popular_names = Counter(zip(df.Name, df.Surname)).most_common(10)

answered May 12 '18 at 20:13

cs95

379,657
97
704
746

Number of occurrence of pair of value in dataframe

2 Answers2

`GroupBy.size`

`value_counts`

`collections.Counter`

Linked