I have a dataframe with columns['join key', 'code', 'A', 'B', 'C', 'D']
.
The join key
is a long string of characters while code
can either be equal to 521, 539 or a bunch of others numbers.
There can be rows with the same join key
.
I want to be able to group by join key
my records and ONLY have all those duplicate records that appear in my dataframe with the code
field set to both 521 and 539.
I do NOT want those records that may have the same key and code=521
but no code=539
counterpart.
Once I have all the pair of duplicates I want to apply a function, compute(x)
that will merge the pair into one row made of the ['A', 'B']
columns from the row with code=521
and the ['C', 'D']
columns from the row with code=539
.
I can't seem to select to right rows. I keep selecting those rows that are duplicate on join key
but are all of the same code with not counterpart.
import pandas as pd
exp_csv = pd.DataFrame([[3566190983, 521, 1, 2, 3, 4], \
[3566190983, 539, 1, 2, 3, 4], \
[3509730366, 521, 1, 2, 3, 4], \
[3509730366, 521, 1, 2, 3, 4], \
[3509730366, 521, 1, 2, 3, 4], \
[9809730686, 500, 1, 2, 3, 4]],\
columns=['join key', 'code', 'A', 'B', 'C', 'D'])
interim = exp_csv[exp_csv.duplicated(['join key'], keep=False)]
interim = interim.loc[(interim['code'] == 521) | (interim['code'] == 539)]
interim = interim.groupby(by='join key').apply(lambda x: compute(x))