finding duplicates and adding ID as attribute pandas

Question

I'm working in geopandas with a large number (around 4.5 million) objects, where each has a unique ID number ('PARCEL_SPI') and also another code ('PC_PLANNO').

What I would like to do is write some code that, for each object, finds all other objects with the same PLANNO and adds their ID number as a list in a new attribute, say 'Same_code' for the object. The df is called spine_copy.

Here's a quick sample of what I have:

PARCEL_SPI	PC_PLANNO
23908	LP12345
90435	LP12345
329048	LP90803
6409	LP2399
34534	LP90803
092824	LP12345

and what I want out:

PARCEL_SPI	PC_PLANNO	Same_code
23908	LP12345	[90435, 092824]
90435	LP12345	[23908,092824]
329048	LP90803	34534
6409	LP2399	None
34534	LP90803	329048
092824	LP12345	[23908, 90435]

I'm not too sure how to do this, but here's my attempt using groupby:

spine_copy.groupby('PC_PLANNO')['PARCEL_SPI'].apply(list)

However, this doesn't add the list as a new attribute for each object, and I'm unsure how to do this.

Thanks in advance!

Unfortunately needs to be a new attribute - the output for this project is basically a large database with attributes for each object — lev72748, Oct 20 '21 at 11:55

jezrael · Accepted Answer · 2021-10-20T12:09:36.297

2

Here converting to list is not necessary - filter duplciated rows by Series.duplicated and for it use GroupBy.transform with invert mask passed to numpy.where:

m = spine_copy['PC_PLANNO'].duplicated(keep=False)
s = spine_copy.groupby('PC_PLANNO')['PARCEL_SPI'].transform(lambda x: x.to_numpy()[::-1])
spine_copy['Same_code'] = np.where(m, s, None)
print (spine_copy)
   PARCEL_SPI PC_PLANNO Same_code
0       23908   LP12345     90435
1       90435   LP12345     23908
2      329048   LP90803     34534
3        6409    LP2399      None
4       34534   LP90803    329048

EDIT: with new data:

m = spine_copy['PC_PLANNO'].duplicated(keep=False)

new = spine_copy.groupby('PC_PLANNO')['PARCEL_SPI'].apply(list).rename('Same_code')
vals = spine_copy.join(new, on='PC_PLANNO')[['PARCEL_SPI','Same_code']]
s = [[z for z in y if z != x] for x, y in vals.to_numpy()]

spine_copy['Same_code'] = np.where(m, s, None)
print (spine_copy)
   PARCEL_SPI PC_PLANNO       Same_code
0       23908   LP12345  [90435, 92824]
1       90435   LP12345  [23908, 92824]
2      329048   LP90803         [34534]
3        6409    LP2399            None
4       34534   LP90803        [329048]
5       92824   LP12345  [23908, 90435]

edited Oct 20 '21 at 12:09

answered Oct 20 '21 at 07:08

jezrael

822,522
95
1,334
1,252

Seen you prefer to use `.to_numpy()` instead of `.values`. Any reason for that ? Just curious. – SeaBean Oct 20 '21 at 08:05
@SeaBean - sure, because [this](https://stackoverflow.com/a/54508052/2901002) – jezrael Oct 20 '21 at 08:06
1

Thanks for pointing to the reason. Thank you very much :-) – SeaBean Oct 20 '21 at 08:07
Thank you @jezrael! Just curious - will this also work if there are multiple with the same code? In some instances there are several dozen – lev72748 Oct 20 '21 at 11:54
@lev72748 - Not sure if understand, my code only invert order. – jezrael Oct 20 '21 at 11:56
@jezrael I mean, in the actual data there will be many objects that have the same PC_PLANNO, so some objects will need to have multiple PARCEL_SPIs appearing in the Same_code column – lev72748 Oct 20 '21 at 11:57
@lev72748 - Without sample data + expected ouput I have no idea what you think. Unfortuantely. – jezrael Oct 20 '21 at 11:58
@jezrael I've edited the question to show what I mean: consider the PC_PLANNO LP12345 - since there are 3 objects with the same value, there would be 2 entries appearing in the Same_code column (I've shown this as a list in the desired output) – lev72748 Oct 20 '21 at 12:00
@lev72748 - Give me some time for nswer with nw data – jezrael Oct 20 '21 at 12:03
@lev72748 - added new solution to answer. – jezrael Oct 20 '21 at 12:16
1

@jezrael thanks so much, great solution! – lev72748 Oct 21 '21 at 22:45

score 1 · Answer 2 · answered Oct 20 '21 at 07:03

May be you can try:

other = df.groupby('PC_PLANNO')['PARCEL_SPI'].apply(lambda x: x.tolist()).reset_index()
df = df.merge(other.rename(columns={'PARCEL_SPI':'Same_code'}), how='left', on=['PC_PLANNO'])
df['Same_code'] = df[['PARCEL_SPI', 'Same_code']].apply(lambda x: list(set(x['Same_code']) - set([x['PARCEL_SPI']])), axis=1)

OUTPUT:

   PARCEL_SPI PC_PLANNO Same_code
0       23908   LP12345   [90435]
1       90435   LP12345   [23908]
2      329048   LP90803   [34534]
3        6409    LP2399        []
4       34534   LP90803  [329048]

finding duplicates and adding ID as attribute pandas

2 Answers2