How do I remove all the duplicates in each column respectively?

Question

I have multiple columns in a dataframe.

Some have duplicates, others are completely empty.

I need to delete the duplicates within each column.

dict_A = {'Firm_A': ['A', 'A', 'B','C'],'Firm_B': [], 'Firm_C': ['B', 'A', 'B','D'], 'Firm_D': ['C', 'C', 'A','D']}
df = pd.DataFrame.from_dict(dict_A, orient = 'index')
df = df.T
df

  Firm_A Firm_B Firm_C  Firm_D
0   A     None    C       C
1   A     None    A       C
2   B     None    B       A
3   C     None    D       A

Desired Output:

  Firm_A Firm_B Firm_C  Firm_D
0   A     None    C       C
1   B     None    A       A
2   C     None    B       None
3  None   None    D       None

So far I have tried drop.duplicates().

I have around 3000 columns with each column having around 100 values.

score 0 · Accepted Answer · answered Jul 30 '19 at 22:32

Here is one way using duplicated

df.mask(df.apply(pd.Series.duplicated,0)).apply(lambda x : sorted(x,key=pd.isnull))
Out[542]: 
  Firm_A Firm_B Firm_C Firm_D
0      A   None      C      C
1      B    NaN      A      A
2      C    NaN      B    NaN
3    NaN    NaN      D    NaN

Also if you want to speed up check justify

score 0 · Answer 2 · answered Jul 30 '19 at 22:33

0

You're doing this with the wrong data structure. If the entries in a data frame row are not related to one another, then a df is likely not an applicable data structure. Instead, de-dup your dictionary values:

DictA = {k: list(set(v)) for k, v in DictA.items()}

After that, if you happen to have some strange application for data frame columns, you can build the df from there.

answered Jul 30 '19 at 22:33

Prune

76,765
14
60
81

1

You may want to add sorted `{k: list(sorted(set(v),key=v.index)) for k, v in dict_A.items()} ` – BENY Jul 30 '19 at 22:39
Why add sorted? OP's example assumes stability in the de-dup process, which I've ignored (left for the student) in my example. `sorted` adds work to the instability, neh? – Prune Jul 30 '19 at 23:39
I think set will order the list . at least on my side can you check the output ? – BENY Jul 31 '19 at 00:42

Andy L. · Answer 3 · 2019-07-30T23:12:41.877

Use stack to convert df to series. Next, groupby on level=1 and call unique to construct lists of unique values per column. Finally, construct a new df from unique and transpose

s = df.stack().groupby(level=1).unique()

Out[279]:
Firm_A       [A, B, C]
Firm_B          [None]
Firm_C    [C, A, B, D]
Firm_D          [C, A]
dtype: object


pd.DataFrame(s.tolist(), index=df.columns).T.reindex(df.index)

Out[280]:
  Firm_A Firm_B Firm_C Firm_D
0      A   None      C      C
1      B   None      A      A
2      C   None      B   None
3   None   None      D   None

How do I remove all the duplicates in each column respectively?

3 Answers3