7

This is my dataframe:

pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
              'B':[0, 2, 3, 4, 5, 6, 7],
              'C':[[1,4,4,4], [1,4,4,4], [3,4,4,5], [3,4,4,5], [4,4,2,1], [1,2,3,4,], [7,8,9,1]]})

I want to get set\drop duplicate values of column C per row but not drop duplicate rows.

This what I hope to get:

pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
              'B':[0, 2, 3, 4, 5, 6, 7],
              'C':[[1,4], [1,4], [3,4,5], [3,4,5], [4,2,1], [1,2,3,4,], [7,8,9,1]]})
Liam
  • 317
  • 1
  • 11
matan
  • 451
  • 4
  • 12

4 Answers4

9

If you're using python 3.7>, you could could map with dict.fromkeys, and obtain a list from the dictionary keys (the version is relevant since insertion order is maintained starting from there):

df['C'] = df.C.map(lambda x: list(dict.fromkeys(x).keys()))

For older pythons you have collections.OrderedDict:

from collections import OrderedDict
df['c']= df.C.map(lambda x: list(OrderedDict.fromkeys(x).keys()))

print(df)

   A  B             C
0  1  0        [1, 4]
1  3  2        [1, 4]
2  3  3     [3, 4, 5]
3  4  4     [3, 4, 5]
4  5  5     [4, 2, 1]
5  3  6  [1, 2, 3, 4]
6  3  7  [7, 8, 9, 1]

As mentioned by cs95 in the comments, if we don't need to preserve order we could go with a set for a more concise approach:

df['c'] = df.C.map(lambda x: [*{*x}])

Since several approaches have been proposed and is hard to tell how they will perform on large dataframes, probably worth benchmarking:

df = pd.concat([df]*50000, axis=0).reset_index(drop=True)

perfplot.show(
    setup=lambda n: df.iloc[:int(n)], 

    kernels=[
        lambda df: df.C.map(lambda x: list(dict.fromkeys(x).keys())),
        lambda df: df['C'].map(lambda x: pd.factorize(x)[1]),
        lambda df: [np.unique(item) for item in df['C'].values],
        lambda df: df['C'].explode().groupby(level=0).unique(),
        lambda df: df.C.map(lambda x: [*{*x}]),
    ],

    labels=['dict.from_keys', 'factorize', 'np.unique', 'explode', 'set'],
    n_range=[2**k for k in range(0, 18)],
    xlabel='N',
    equality_check=None
)

enter image description here

yatu
  • 86,083
  • 12
  • 84
  • 139
  • 1
    The obvious solution is this. If order doesn't matter then `df.C.map(lambda x: [*{*x}])` is way shorter. May be worth mentioning to use `OrderedDict.fromkeys` on older versions to maintain order. – cs95 Jul 13 '20 at 09:07
  • Yes, indeed worth mentioning, added. Like the unpacking one! might be worth adding too, aside from the more general one which does preserve order @cs95 – yatu Jul 13 '20 at 09:24
  • 1
    interesting how the numpy solution performs at smaller volumes. the list/set unpacking is very clever @cs95 – Umar.H Jul 13 '20 at 19:39
8

if order is of no importance you could cast the column to a numpy array and apply an operation on each row in a list comprehension.

import numpy as np
df['C_Unique'] = [np.unique(item) for item in df['C'].values]

print(df)

   A  B             C      C_Unique
0  1  0  [1, 4, 4, 4]        [1, 4]
1  3  2  [1, 4, 4, 4]        [1, 4]
2  3  3  [3, 4, 4, 5]     [3, 4, 5]
3  4  4  [3, 4, 4, 5]     [3, 4, 5]
4  5  5  [4, 4, 2, 1]     [1, 2, 4]
5  3  6  [1, 2, 3, 4]  [1, 2, 3, 4]
6  3  7  [7, 8, 9, 1]  [1, 7, 8, 9]

Another method would be to to use explode and groupby.unique

df['CExplode'] = df['C'].explode().groupby(level=0).unique()

  A  B             C      C_Unique      CExplode
0  1  0        [1, 4]        [1, 4]        [1, 4]
1  3  2        [1, 4]        [1, 4]        [1, 4]
2  3  3     [3, 4, 5]     [3, 4, 5]     [3, 4, 5]
3  4  4     [3, 4, 5]     [3, 4, 5]     [3, 4, 5]
4  5  5     [4, 2, 1]     [1, 2, 4]     [4, 2, 1]
5  3  6  [1, 2, 3, 4]  [1, 2, 3, 4]  [1, 2, 3, 4]
6  3  7  [7, 8, 9, 1]  [1, 7, 8, 9]  [7, 8, 9, 1]
Umar.H
  • 22,559
  • 7
  • 39
  • 74
3

You can use apply function in pandas.

df['C'] = df['C'].apply(lambda x: list(set(x)))
Ashok Krishna
  • 143
  • 1
  • 5
2

map and factorize

Let's throw one more into the mix.

df['C'].map(pd.factorize).str[1]

0          [1, 4]
1          [1, 4]
2       [3, 4, 5]
3       [3, 4, 5]
4       [4, 2, 1]
5    [1, 2, 3, 4]
6    [7, 8, 9, 1]
Name: C, dtype: object

Or,

df['C'].map(lambda x: pd.factorize(x)[1])

0          [1, 4]
1          [1, 4]
2       [3, 4, 5]
3       [3, 4, 5]
4       [4, 2, 1]
5    [1, 2, 3, 4]
6    [7, 8, 9, 1]
Name: C, dtype: object
cs95
  • 379,657
  • 97
  • 704
  • 746