Pandas: how to select only the duplicate rows that have the same key but different values in a column

Question

I have a dataframe with columns['join key', 'code', 'A', 'B', 'C', 'D'].

The join key is a long string of characters while code can either be equal to 521, 539 or a bunch of others numbers.

There can be rows with the same join key.

I want to be able to group by join key my records and ONLY have all those duplicate records that appear in my dataframe with the code field set to both 521 and 539.

I do NOT want those records that may have the same key and code=521 but no code=539counterpart.

Once I have all the pair of duplicates I want to apply a function, compute(x) that will merge the pair into one row made of the ['A', 'B'] columns from the row with code=521 and the ['C', 'D'] columns from the row with code=539.

I can't seem to select to right rows. I keep selecting those rows that are duplicate on join key but are all of the same code with not counterpart.

import pandas as pd
exp_csv = pd.DataFrame([[3566190983, 521, 1, 2, 3, 4], \
                        [3566190983, 539, 1, 2, 3, 4], \
                        [3509730366, 521, 1, 2, 3, 4], \
                        [3509730366, 521, 1, 2, 3, 4], \
                        [3509730366, 521, 1, 2, 3, 4], \
                        [9809730686, 500, 1, 2, 3, 4]],\
                        columns=['join key', 'code', 'A', 'B', 'C', 'D'])

interim = exp_csv[exp_csv.duplicated(['join key'], keep=False)]
interim = interim.loc[(interim['code'] == 521) | (interim['code'] == 539)]
interim = interim.groupby(by='join key').apply(lambda x: compute(x))

Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. — jezrael, Jan 16 '18 at 13:00

score 2 · Answer 1 · answered Jan 16 '18 at 13:30

2

I believe you need compare sets and filter by boolean indexing:

mask = exp_csv.groupby('join key')['code'].transform(lambda x: set(x) == set([521, 539]))
interim = exp_csv[mask]
print (interim)
     join key  code  A  B  C  D
0  3566190983   521  1  2  3  4
1  3566190983   539  1  2  3  4

answered Jan 16 '18 at 13:30

jezrael

822,522
95
1,334
1,252

Thank you, that does indeed work! But how could I also add up the two rows? Maybe I could groupby again by key and the apply a function on each pair? – Indaco789 Jan 16 '18 at 13:50
Is there a way to combine two rows in the way I explained in the OP? I would like to have a new row with only columns`['join key', 'code', 'A', 'B']` from one row and only columns `['C', 'D']` from the other row. I thought about setting some columns to `nan` and then summing the two rows but I can't seem to do it. – Indaco789 Jan 16 '18 at 14:03

Pandas: how to select only the duplicate rows that have the same key but different values in a column

1 Answers1