pandas - Merge nearly duplicate rows based on column value

Question

I have a pandas dataframe with several rows that are near duplicates of each other, except for one value. My goal is to merge or "coalesce" these rows into a single row, without summing the numerical values.

Here is an example of what I'm working with:

Name   Sid   Use_Case  Revenue
A      xx01  Voice     $10.00
A      xx01  SMS       $10.00
B      xx02  Voice     $5.00
C      xx03  Voice     $15.00
C      xx03  SMS       $15.00
C      xx03  Video     $15.00

And here is what I would like:

Name   Sid   Use_Case            Revenue
A      xx01  Voice, SMS          $10.00
B      xx02  Voice               $5.00
C      xx03  Voice, SMS, Video   $15.00

The reason I don't want to sum the "Revenue" column is because my table is the result of doing a pivot over several time periods where "Revenue" simply ends up getting listed multiple times instead of having a different value per "Use_Case".

What would be the best way to tackle this issue? I've looked into the groupby() function but I still don't understand it very well.

score 99 · Accepted Answer · edited Jun 02 '18 at 05:46

99

I think you can use groupby with aggregate first and custom function ', '.join:

df = df.groupby('Name').agg({'Sid':'first', 
                             'Use_Case': ', '.join, 
                             'Revenue':'first' }).reset_index()

#change column order                           
print df[['Name','Sid','Use_Case','Revenue']]                              
  Name   Sid           Use_Case Revenue
0    A  xx01         Voice, SMS  $10.00
1    B  xx02              Voice   $5.00
2    C  xx03  Voice, SMS, Video  $15.00

Nice idea from comment, thanks Goyo:

df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index()

#change column order                           
print df[['Name','Sid','Use_Case','Revenue']]                              
  Name   Sid           Use_Case Revenue
0    A  xx01         Voice, SMS  $10.00
1    B  xx02              Voice   $5.00
2    C  xx03  Voice, SMS, Video  $15.00

edited Jun 02 '18 at 05:46

leoschet

1,697
17
33

answered Mar 28 '16 at 21:29

jezrael

822,522
95
1,334
1,252

5

I would group by everything except `'Use_Case'`, just in case. Also the aggregate function can be just `', '.join`, no need to use `lambda`.. – Stop harming Monica Mar 28 '16 at 21:41
Turns out this breaks if your column has values that `join` doesn't like. I had to throw a `.map(str)` in before the `apply` for it to work cleanly. – Eric Ed Lohmar Jul 13 '17 at 18:14
1

Yes, or use `.astype(str)`, it is function for cast to string. – jezrael Jul 13 '17 at 18:21
@jezrael when attempting your solution the following error code is received: "Cannot access callable attribute 'astype' of 'SeriesGroupBy' objects, try using the 'apply' method". Do you know what would cause this? – MaxB Mar 11 '19 at 19:01
1

@jezrael how to join only unique values of 'Use_Case': ', '.join, – panda Mar 20 '19 at 13:21
1

@panda - change `', '.join` to `lambda x: ', '.join(set(x))` – jezrael Mar 20 '19 at 13:22
1

@jezrael Please check the chat box – panda Mar 20 '19 at 13:32
N.b. this works, but you need to remove all nulls first. – Isaac Mar 03 '20 at 13:04
@jezrael can you please explain to me what does `'first'` in agg function do? ```.agg({'Sid':'first', 'Use_Case': ', '.join, 'Revenue':'first' })``` – Nemra Khalil Nov 10 '20 at 13:36
@NemraKhalil - it means get first value per groups. – jezrael Nov 10 '20 at 13:37
@NemraKhalil - Reason is if not use any aggregation, columns are omited. – jezrael Nov 10 '20 at 13:40
@jezrael Can you pls tell me how to create sum the "Revenue" column? – user1862965 Apr 18 '21 at 21:56
@user1862965 - change `'Revenue':'first'` to `'Revenue':'sum'` – jezrael Apr 19 '21 at 04:18
1

@jezrael thanks a lot, what would be equivalent to MYSQL this Panda code? df.groupby('Name').agg({'Sid':'first', 'Use_Case': ', '.join, 'Revenue':'sum' }).reset_index() – user1862965 Apr 19 '21 at 07:03

score 23 · Answer 2 · answered Mar 28 '16 at 21:31

23

You can groupby and apply the list function:

>>> df['Use_Case'].groupby([df.Name, df.Sid, df.Revenue]).apply(list).reset_index()
    Name    Sid     Revenue     0
0   A   xx01    $10.00  [Voice, SMS]
1   B   xx02    $5.00   [Voice]
2   C   xx03    $15.00  [Voice, SMS, Video]

(In case you are concerned about duplicates, use set instead of list.)

answered Mar 28 '16 at 21:31

Ami Tavory

74,578
11
141
185

1

Can't thank you enough for this answer here! – seizethedata Apr 28 '21 at 11:47
1

This is the sexiest solution, in my opinion :) – mrGott Nov 10 '21 at 19:48

Eric Ed Lohmar · Answer 3 · 2017-07-13T19:10:54.777

I was using some code that I didn't think was optimal and eventually found jezrael's answer. But after using it and running a timeit test, I actually went back to what I was doing, which was:

cmnts = {}
for i, row in df.iterrows():
    while True:
        try:
            if row['Use_Case']:
                cmnts[row['Name']].append(row['Use_Case'])

            else:
                cmnts[row['Name']].append('n/a')

            break

        except KeyError:
            cmnts[row['Name']] = []

df.drop_duplicates('Name', inplace=True)
df['Use_Case'] = ['; '.join(v) for v in cmnts.values()]

According to my 100 run timeit test, the iterate and replace method is an order of magnitude faster than the groupby method.

import pandas as pd
from my_stuff import time_something

df = pd.DataFrame({'a': [i / (i % 4 + 1) for i in range(1, 10001)],
                   'b': [i for i in range(1, 10001)]})

runs = 100

interim_dict = 'txt = {}\n' \
               'for i, row in df.iterrows():\n' \
               '    try:\n' \
               "        txt[row['a']].append(row['b'])\n\n" \
               '    except KeyError:\n' \
               "        txt[row['a']] = []\n" \
               "df.drop_duplicates('a', inplace=True)\n" \
               "df['b'] = ['; '.join(v) for v in txt.values()]"

grouping = "new_df = df.groupby('a')['b'].apply(str).apply('; '.join).reset_index()"

print(time_something(interim_dict, runs, beg_string='Interim Dict', glbls=globals()))
print(time_something(grouping, runs, beg_string='Group By', glbls=globals()))

yields:

Interim Dict
  Total: 59.1164s
  Avg: 591163748.5887ns

Group By
  Total: 430.6203s
  Avg: 4306203366.1827ns

where time_something is a function which times a snippet with timeit and returns the result in the above format.

score 1 · Answer 4 · answered Jan 25 '22 at 15:02

Following @jezrael and @leoschet answers, I would like to provide a more general example in case there are many more columns in the dataframe, something I had to do recently.

Specifically, my dataframe had a total of 184 columns.

The column REF is the one that should be used as a reference for the groupby and only another one, called IDS, of the remaining 182, was different and I wanted to collapse its elements into a list id1, id2, id3...

So:

# Create a dictionary {df_all_columns_name : 'first', 'IDS': join} for agg
# Also avoid REF column in dictionary (inserted after aggregation)
columns_collapse = {c: 'first' if c != 'IDS' else ', '.join for c in my_df.columns.tolist() if c != 'REF'}
my_df = my_df.groupby('REF').agg(columns_collapse).reset_index()

I hope this is also useful to someone!

Regards!

pandas - Merge nearly duplicate rows based on column value

4 Answers4

Linked

Related