Remove duplicates from rows and columns (cell) in a dataframe, python

Question

I have two columns with a lot of duplicated items per cell in a dataframe. Something similar to this:

Index   x    y  
  1     1    ec, us, us, gbr, lst
  2     5    ec, us, us, us, us, ec, ec, ec, ec
  3     8    ec, us, us, gbr, lst, lst, lst, lst, gbr
  4     5    ec, ec, ec, us, us, ir, us, ec, ir, ec, ec
  5     7    chn, chn, chn, ec, ec, us, us, gbr, lst

I need to eliminate all the duplicate items an get a resulting dataframe like this:

Index   x    y  
  1     1    ec, us, gbr, lst
  2     5    ec, us
  3     8    ec, us, gbr,lst
  4     5    ec, us, ir
  5     7    chn, ec, us, gbr, lst

Thanks!!

So, what did you already try out in order to get the result you want? — 1313e, Jan 04 '18 at 04:30
https://stackoverflow.com/questions/7794208/how-can-i-remove-duplicate-words-in-a-string-with-python mutiple function there, what you need is just apply those to your dataframe — BENY, Jan 04 '18 at 04:55

score 20 · Accepted Answer · edited Sep 15 '21 at 09:40

20

Split and apply set and join i.e

df['y'].str.split(', ').apply(set).str.join(', ')

0         us, ec, gbr, lst
1                   us, ec
2         us, ec, gbr, lst
3               us, ec, ir
4    us, lst, ec, gbr, chn
Name: y, dtype: object

Update based on comment :

df['y'].str.replace('nan|[{}\s]','', regex=True).str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",", regex=True)

# Replace all the braces and nan with `''`, then split and apply set and join

edited Sep 15 '21 at 09:40

kağan hazal koçdemir

715
4
18

answered Jan 04 '18 at 04:34

Bharath M Shetty

30,075
6
57
108

it works perfect @Dark ... but I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it? – PAstudilloE Jan 04 '18 at 05:11
@PAstudilloE are you saying the y column is like {ec,us.. before running this code or after running this code? – Bharath M Shetty Jan 04 '18 at 05:15
before running the code. the original columns are {ec, us, ..., nan} @Dark – PAstudilloE Jan 04 '18 at 05:18
it works well. The only problem that I have now is that the results I'm getting are like this: , , us, ec... (the nan's are erased but the commas are still there). Do you have any guidance on how to solve that? – PAstudilloE Jan 04 '18 at 05:49
For FutureWarning error add regex=True in replace – kağan hazal koçdemir Sep 14 '21 at 19:56

Hans Musgrave · Answer 2 · 2018-01-04T05:33:21.203

1

If you don't care about item order, and assuming the data type of everything in column y is a string, you can use the following snippet:

df['y'] = df['y'].apply(lambda s: ', '.join(set(s.split(', '))))

The set() conversion is what removes duplicates. I think in later versions of python it might preserve order (3.4+ maybe?), but that is an implementation detail rather than a language specification.

edited Jan 04 '18 at 05:33

answered Jan 04 '18 at 04:36

Hans Musgrave

6,613
1
18
37

That call to `list` isn't needed. – Turn Jan 04 '18 at 04:41
I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it? – PAstudilloE Jan 04 '18 at 04:42
Even in Python 3.10, `set`s are [documented as unordered collections](https://docs.python.org/3.10/tutorial/datastructures.html#sets), so they should not be used if the order in which items are inserted or enumerated is important to a program. – Peter O. Jan 02 '22 at 17:52

score 1 · Answer 3 · answered Jan 04 '18 at 04:37

1

Try this:

d['y'] = d['y'].apply(lambda x: ', '.join(sorted(set(x.split(', ')))))

answered Jan 04 '18 at 04:37

koPytok

3,453
1
14
29

it works perfect!... but I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it? – PAstudilloE Jan 04 '18 at 04:43

score 0 · Answer 4 · answered Jan 04 '18 at 04:44

0

use the apply method on the dataframe.

# change this function according to your needs
def dedup(row):
    return list(set(row.y))

df['deduped'] = df.apply(dedup, axis=1)

answered Jan 04 '18 at 04:44

srj

9,591
2
23
27

Remove duplicates from rows and columns (cell) in a dataframe, python

4 Answers4

Linked