Problem
How to remove duplicate cells from each row, considering each row separately (and perhaps replace them with NaNs) in a Pandas dataframe?
It would be even better if we could shift all newly created NaNs to the end of each row.
Related but different posts
Posts on how to remove entire rows which are deemed duplicate:
- how do I remove rows with duplicate values of columns in pandas data frame?
- Drop all duplicate rows across multiple columns in Python Pandas
- Remove duplicate rows from Pandas dataframe where only some columns have the same value
Post on how to remove duplicates from a list which is in a Pandas column:
Answer given here returns a series of strings, not a dataframe.
Reproducible setup
import pandas as pd
Let's create a dataframe:
df = pd.DataFrame({'a': ['A', 'A', 'C', 'B'],
'b': ['B', 'D', 'B', 'B'],
'c': ['C', 'C', 'C', 'A'],
'd': ['D', 'D', 'B', 'A']},
index=[0, 1, 2, 3])
df
created:
+----+-----+-----+-----+-----+
| | a | b | c | d |
|----+-----+-----+-----+-----|
| 0 | A | B | C | D |
| 1 | A | D | C | D |
| 2 | C | B | C | B |
| 3 | B | B | A | A |
+----+-----+-----+-----+-----+
(Printed using this.)
A solution
One way of dropping duplicates from each row, considering each row separately:
df = df.apply(lambda row: pd.Series(row).drop_duplicates(keep='first'),axis='columns')
using apply(), a lambda function, pd.Series(), & Series.drop_duplicates().
Shove all NaNs to the end of each row, using Shift NaNs to the end of their respective rows:
df.apply(lambda x : pd.Series(x[x.notnull()].values.tolist()+x[x.isnull()].values.tolist()),axis='columns')
Output:
+----+-----+-----+-----+-----+
| | 0 | 1 | 2 | 3 |
|----+-----+-----+-----+-----|
| 0 | A | B | C | D |
| 1 | A | D | C | nan |
| 2 | C | B | nan | nan |
| 3 | B | A | nan | nan |
+----+-----+-----+-----+-----+
Just as we wished.
Question
Is there a more efficient way to do this? Perhaps with some built-in Pandas functions?