1

I would like to delete the second matching for each row. Column X1 is the column that we will be matching against, it's always the reference, we don't delete values from X1

Example (starting point) DataFrame df_client:

| Index |Name   |email             |city   |X1 |X2 |X3 |X4 |X5 |
---     |---    |---               |---    |---|---|---|---|---|
| 0     |Mary   |Mary@hotmail.com  |London |AB1|KD2|AB1|   |CM2|
| 1     |john   |john@hotmail.com  |Tokyo  |LK2|LK2|   |IG5|   |
| 2     |karl   |karl@hotmail.com  |London |MK6|   |MK6|   |
| 3     |jasmin |jasmin@hotmail.com|Toronto|UH5|FG6|UH5|   |   |
| 4     |Frank  |Frank@hotmail.com |Paris  |PO4|   |   |PO4|
| 5     |lee    |lee@hotmail.com   |Madrid |RT3|RT3|WS1|   |   |

I would like to compare the values X2,X3,X4,X5 always to X1 and that for each row.

When we find a matching value (e.g. row 0 I would like to delete AB1 from X3). In other words, we always keep the value in X1 and delete the matching value from X2 or X3 or X4 or X5.

I would like to add that it's a guarantee that each row will have a value in X2 or X3 or X4 or X5 that matches a value in X1:

The desired result will look like this:

|Index|Name  |email             |city   |X1  |X2|X3 |X4 |X5 |
 ---  |---   |---               |    ---|---|---|---|---|---
| 0   |Mary  |Mary@hotmail.com  |London |AB1|KD2|   |   |CM2|
| 1   |john  |john@hotmail.com  |Tokyo  |LK2|   |   |IG5|   |
| 2   |karl  |karl@hotmail.com  |London |MK6|   |   |   |   |
| 3   |jasmin|jasmin@hotmail.com|Toronto|UH5|FG6|   |   |   |
| 4   |Frank |Frank@hotmail.com |Paris  |PO4|   |   |   |   |
| 5   |lee   |lee@hotmail.com   |Madrid |RT3|WS1|   |   |

It's not important but ideally, I would like to be able to move the values to the left if there are empty cells; something like this :

|Index|Name  |email             |city   |X1 |X2 |X3 |X4 |X5 |
 ---  |---   |               ---|---    |---|---|---|---|---
| 0   |Mary  |Mary@hotmail.com  |London |AB1|KD2|CM2|   |   |
| 1   |john  |john@hotmail.com  |Tokyo  |LK2|IG5|   |   |   |
| 2   |karl  |karl@hotmail.com  |London |MK6|   |   |   |   |
| 3   |jasmin|jasmin@hotmail.com|Toronto|UH5|FG6|   |   |   |
| 4   |Frank |Frank@hotmail.com |Paris  |PO4|   |   |   |   |
| 5   |lee   |lee@hotmail.com   |Madrid |RT3|WS1|   |   |   |

Moving the values to the left is really not important, if you can help me with just deleting the matching values that will be more than enough.

Thank you

PyD
  • 25
  • 4
  • Maybe some of these answers can help you: https://stackoverflow.com/questions/63583502/removing-duplicates-from-pandas-rows-replace-them-with-nans-shift-nans-to-end – nolan Mar 10 '22 at 01:40

1 Answers1

1

You can do it with apply, a lambda function, and drop duplicates on axis=1 which operates on rows. This shifts the columns around, but you can store the order beforehand and reassign when you're done.

df = pd.DataFrame({'Name': {0: 'Mary', 1: 'john', 2: 'karl', 3: 'jasmin', 4: 'Frank', 5: 'lee'},
 'email': {0: 'Mary@hotmail.com',
  1: 'john@hotmail.com',
  2: 'karl@hotmail.com',
  3: 'jasmin@hotmail.com',
  4: 'Frank@hotmail.com',
  5: 'lee@hotmail.com'},
 'city': {0: 'London',
  1: 'Tokyo',
  2: 'London',
  3: 'Toronto',
  4: 'Paris',
  5: 'Madrid'},
 'X1': {0: 'AB1', 1: 'LK2', 2: 'MK6', 3: 'FG6', 4: 'PO4', 5: 'RT3'},
 'X2': {0: 'KD2', 1: 'LK2', 2: 'MK6', 3: 'UH5', 4: 'PO4', 5: 'RT3'},
 'X3': {0: 'AB1', 1: 'IG5', 2: None, 3: None, 4: None, 5: 'WS1'},
 'X4': {0: 'CM2', 1: None, 2: None, 3: None, 4: None, 5: None},
 'X5': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan, 5: np.nan}})

col_order = df.columns
df = df.apply(lambda x: x.drop_duplicates(keep='first'), axis=1)
df = df[col_order]

Output df:

    Name    email               city    X1  X2   X3     X4      X5
0   Mary    Mary@hotmail.com    London  AB1 KD2  NaN    CM2     NaN
1   john    john@hotmail.com    Tokyo   LK2 NaN  IG5    None    NaN
2   karl    karl@hotmail.com    London  MK6 NaN  None   NaN     NaN
3   jasmin  jasmin@hotmail.com  Toronto UH5 FG6  None   None    NaN
4   Frank   Frank@hotmail.com   Paris   PO4 NaN  None   NaN     NaN
5   lee     lee@hotmail.com     Madrid  RT3 NaN  WS1    None    NaN

You can do this if you want to shift the data over to the left. You will need to change the indexing on the last line to match how many columns you have left after some get removed when creating shift_vals.

shift_vals = df.apply(lambda x: x.dropna().tolist(), axis=1)
new_df = pd.DataFrame(shift_vals.to_list())
new_df.columns = df.columns[0:-2]
Matthew Borish
  • 3,016
  • 2
  • 13
  • 25