1

I have a pandas dataframe with an identifying column (Name) and three value columns (A, B, C).

df = pd.DataFrame({'Name': ['John', 'John', 'John', 'John', 'John', 'Sam', 'Sam', 'Sam'], 'A': [1, 1, 1, 2, 2, 3, 3, 3], 'B': [1, 1, 2, 2, 2, 3, 4, 3], 'C': [2, 1, 1, 1, 1, 3, 3, 3]})

For each name, I want to find the most common combination of values across the 3 value columns.

The result should be:

['John', 2, 2, 1] # as the combination 2, 2, 1 appears twice
['Sam', 3, 3, 3] # as the combination 3, 3, 3 appears twice

First time asking a question on stack overflow -> any advice on how to improve the question is welcome.

Thanks in advance.

taoufik A
  • 1,439
  • 11
  • 20
T Walker
  • 13
  • 6
  • Thank you for the link @pylang. That question relates to most common value in a single column (applied to multiple columns at once) - I am interested in most common combination of values across multiple columns. – T Walker Jan 04 '18 at 01:50
  • `df.groupby(['A','B','C'])['Name'].value_counts().nlargest(2)` – skrubber Jan 04 '18 at 02:09
  • 1
    To make @skrubber's solution more general `df.groupby(['A','B','C'])['Name'].value_counts().nlargest(len(set(df['Name'])))` – noslenkwah Jan 04 '18 at 02:24

1 Answers1

0

The method is based on the comments by @noslenkwah and @skrubber.

If you need most common row for each name as described

df = pd.DataFrame({'Name': ['John', 'John', 'John', 'John', 'John', 'Sam', 'Sam', 'Sam', 'Sam'] , 'A': [1, 1, 1, 2, 2, 3, 3, 3, 3] , 
                'B': [1, 1, 2, 2, 2, 3, 4, 3, 3] , 'C': [2, 1, 1, 1, 1, 3, 3, 3, 3]})
df_count = df.groupby(["A", "B", "C"])['Name'].value_counts().reset_index(name="Count")
l = list()
for name in set(df_count.Name):
    freq = df_count[df_count.Name == name].Count.max()
    l.append(df_count[(df_count.Name == name) & (df_count.Count == freq)].values.tolist()[0])
print(l)
[[3, 3, 3, 'Sam', 3], [2, 2, 1, 'John', 2]]

First the method group df by ["A", "B", "C"]. Then loop over names, and first calculate the most common rows' frequency within this name. Then filter out groups that have different frequncy and with different names in the query df_count[(df_count.Name == name) & (df_count.Count == freq)].

A more general problem: finding rows with highest frequency regardless names, you can try

df_count = df.groupby(["A", "B", "C"])['Name'].value_counts().reset_index(name="Count")
max_freq = df_count["Count"].max()
df_count[df_count["Count"] == max_freq]

    A   B   C   Name    Count
3   2   2   1   John    2
4   3   3   3   Sam     2

Reference

Pandas: Selecting rows based on value counts of a particular column

Selecting with complex criteria from pandas.DataFrame

Tai
  • 7,684
  • 3
  • 29
  • 49
  • Thank you for your help. This solution does work for the example case I posed, but does not work if the groups have different frequencies, e.g. for this input: df = pd.DataFrame({'Name': ['John', 'John', 'John', 'John', 'John', 'Sam', 'Sam', 'Sam', 'Sam'] , 'A': [1, 1, 1, 2, 2, 3, 3, 3, 3] , 'B': [1, 1, 2, 2, 2, 3, 4, 3, 3] , 'C': [2, 1, 1, 1, 1, 3, 3, 3, 3]}) – T Walker Jan 04 '18 at 05:05
  • What do you mean if groups have different frequencies? You mean you want not only the most frequent ones? You mean you it for each name? – Tai Jan 04 '18 at 05:11
  • I want the most frequent row for both John, and for Sam, even if one has a higher frequency than the other. – T Walker Jan 04 '18 at 05:13
  • Your edit seems to work, but overwrites itself (i.e. solution for John is replaced with solution for Sam) - working on fixing this now. – T Walker Jan 04 '18 at 05:32