Pandas: How can I remove duplicates with single column different values, while retaining said different values

Question

I have seen this question and I cannot understand what it means. It has something to do with flattening the multiple indexes into a single index after pivoting the dataframe.

I am currently working on image processing and I have a dataframe that has duplicate rows, however, each duplicate row has a different value for the 'subject'.

The Goal:

I want to remove these duplicate rows even though they have a different value, and join all the different values from all the duplicates into a single column i.e. Aortic enlargement|Pulmonary fibrosis|Atelectasis.

The Question:

This is essentially a multipart question.

How could I achieve the goal stated above?
Could somebody explain in lamen terms the question I have linked to fully understand it?

Extra Information:

I have a mock csv file that you can access if you need to fully understand what I mean.

Which rows are duplicated??? With which columns do you consider it's a duplicate row? Maybe (16, 57) and (35, 65) on columns right to `class_name` — Corralien, Sep 14 '21 at 06:50
in the `image_id` column I have duplicated row. I have 15k Images but the rows are 67k+. Each image id have a unique disease in `label` col. So I want to remove all those duplication and make the label like: img_id label 0001 fibrosis | Atelectasis I want like this — Engr Ali, Sep 14 '21 at 06:54
Another question: other columns should be dropped? The output contains only 2 columns: `image_id` and `class_name`, that's right? — Corralien, Sep 14 '21 at 06:59

Corralien · Accepted Answer · 2021-09-14T07:21:45.220

Consider this dataframe as MRE:

>>> df
                           image_id          class_name
0  47ed17dcb2cbeec15182ed335a8b5a9e         Nodule/Mass  # <- dup 1
1  47ed17dcb2cbeec15182ed335a8b5a9e  Aortic enlargement  # <- dup 1
2  47ed17dcb2cbeec15182ed335a8b5a9e  Pulmonary fibrosis  # <- dup 1
3  7c1add6833d5f0102b0d3619a1682a64        Lung Opacity  # <- dup 2
4  7c1add6833d5f0102b0d3619a1682a64  Pulmonary fibrosis  # <- dup 2
5  5550a493b1c4554da469a072fdfab974          No finding  # <- dup 3
6  5550a493b1c4554da469a072fdfab974          No finding  # <- dup 3

To get expect outcome, you need to group rows by image_id and join all values from class_name together and separated by ' | ':

>>> df.groupby('image_id')['class_name'].apply(lambda x: ' | '.join(sorted(set(x))))

image_id
47ed17dcb2cbeec15182ed335a8b5a9e    Aortic enlargement | Nodule/Mass | Pulmonary f...
5550a493b1c4554da469a072fdfab974                                           No finding
7c1add6833d5f0102b0d3619a1682a64                    Lung Opacity | Pulmonary fibrosis

Use set to remove class_name duplicates for a same image_id and sorted to get class_name lexicographical ordered.

Update

You can use MultiIndex to show correctly your duplicated rows. Try:

>>> df.set_index(['image_id', 'class_name']).sort_index()

                                             class_id rad_id  x_min  y_min  x_max  y_max  width  height
image_id                         class_name
000434271f63a053c4128a0ba6352c7f No finding        14     R6    NaN    NaN    NaN    NaN   2336    2836
                                 No finding        14     R2    NaN    NaN    NaN    NaN   2336    2836
                                 No finding        14     R3    NaN    NaN    NaN    NaN   2336    2836
00053190460d56c53cc3e57321387478 No finding        14    R11    NaN    NaN    NaN    NaN   1994    2430
                                 No finding        14     R2    NaN    NaN    NaN    NaN   1994    2430
...                                               ...    ...    ...    ...    ...    ...    ...     ...
fff0f82159f9083f3dd1f8967fc54f6a No finding        14     R9    NaN    NaN    NaN    NaN   2048    2500
                                 No finding        14    R14    NaN    NaN    NaN    NaN   2048    2500
fff2025e3c1d6970a8a6ee0404ac6940 No finding        14     R1    NaN    NaN    NaN    NaN   1994    2150
                                 No finding        14     R5    NaN    NaN    NaN    NaN   1994    2150
                                 No finding        14     R2    NaN    NaN    NaN    NaN   1994    2150

[67914 rows x 8 columns]

On dummy data because on your csv file, there is no `image_id` dups. 70 rows in input and 70 rows on output. — Corralien, Sep 14 '21 at 07:10
I am not getting the exact output. I update the file can you check it again? — Engr Ali, Sep 14 '21 at 07:15
With your new data: in: 67914 records -> out: 15000 records. — Corralien, Sep 14 '21 at 07:18
I will brother but as I mentioned I am not expert in pandas so still not getting my required output ? Any Reason did you get as I want? — Engr Ali, Sep 14 '21 at 07:28
What should be the shape of your output? 15000 or 67914 rows? 1, 2, 10, 11 columns? Please mock an output only with few rows and cols. — Corralien, Sep 14 '21 at 07:33
15000 without duplicates just merge the duplicate `image id` and a new column where the data is like `Aortic enlargement | Nodule/Mass | Pulmonary`. That it. — Engr Ali, Sep 14 '21 at 07:35
Like that: `out = df.groupby('image_id')['class_name'].apply(lambda x: ' | '.join(sorted(set(x)))).reset_index()` — Corralien, Sep 14 '21 at 07:42

Pandas: How can I remove duplicates with single column different values, while retaining said different values

1 Answers1