Searching for duplicates and remove them

Question

sometimes I have a string like this

string = "Hett, Agva,"

and sometimes I will have duplicates in it.

string = "Hett, Agva, Delf, Agva, Hett,"

how can I check if my string has duplicates and then if it does remove them?

UPDATE.

So in the second string i need to remove Agva, and Hett, because there is 2x of them in the string

so the `'Hett'` that appears twice does not bother you.. You have to work on your definition a bit. If it is just `'Agva'` you might as well rewrite the string. — Ma0, Aug 29 '18 at 09:49
The OP wants **all** duplicates to be removed then be it `Hett` or `Agva` or `blah` — Sheldore, Aug 29 '18 at 09:52
Possible duplicate of [How can I remove duplicate words in a string with Python?](https://stackoverflow.com/questions/7794208/how-can-i-remove-duplicate-words-in-a-string-with-python) — Ankur Sinha, Aug 29 '18 at 10:09

Joe Iddon · Accepted Answer · 2018-08-29T09:58:46.620

Iterate over the parts (words) and add each part to a set of seen parts and to a list of parts if it is not already in that set. Finally. reconstruct the string:

seen = set()
parts = []
for part in string.split(','):
    if part.strip() not in seen:
        seen.add(part.strip())
        parts.append(part)

no_dups = ','.join(parts)

(note that I had to add some calls to .strip() as there are spaces at the start of some of the words which this method removes)

which gives:

'Hett, Agva, Delf,'

Why use a set?

To query whether an element is in a set, it is O(1) average case - since they are stored by a hash which makes lookup constant time. On the other hand, lookup in a list is O(n) as Python must iterate over the list until the element is found. This means that it is much more efficient for this task to use a set since, for each new word, you can instantly check to see if you have seen in before whereas you'd have to iterate over a list of seen elements otherwise which would take much longer for a large list.

Oh and to just check if there are duplicates, query whether the length of the split list is the same as the set of that list (which removes the duplicates but looses the order).

I.e.

def has_dups(string):
    parts = string.split(',')
    return len(parts) != len(set(parts))

which works as expected:

>>> has_dups('Hett, Agva,')
False
>>> has_dups('Hett, Agva, Delf, Agva, Hett,')
True

@Chaban33 My bad, needed to strip the leading spaces... Now it works :) — Joe Iddon, Aug 29 '18 at 09:52

haccks · Answer 2 · 2018-08-29T10:22:36.747

1

If order of words id important then you can make a list of words in the string and then iterate over the list to make a new list of unique words.

string = "Hett, Agva, Delf, Agva, Hett,"
words_list = string.split()

unique_words = []
[unique_words.append(w) for w in words_list if w not in unique_words]
new_string = ' '.join(unique_words)
print (new_String)

Output:

'Hett, Agva, Delf,'

edited Aug 29 '18 at 10:22

answered Aug 29 '18 at 09:46

haccks

104,019
25
176
264

Once you make a list of *only* words *without* comma, just doing `set(list_name)` would do the job. For ex. `x = ['a', 'b', 'a']` and `set(x)` gives `{'a', 'b'}`. That can be further converted to a list – Sheldore Aug 29 '18 at 09:48
@Bazingaa No because then you loose the order. My logic is necessary. – Joe Iddon Aug 29 '18 at 09:53
@Bazingaa; OP has not specified if he wants to remove ','. So I left it as it is. – haccks Aug 29 '18 at 09:55
This isn't as efficient as using a set though, see my answer. – Joe Iddon Aug 29 '18 at 09:59
@JoeIddon: Well, unless the OP specifies that the order must be maintained, we don't know. – Sheldore Aug 29 '18 at 10:00
@Bazingaa I guess. – Joe Iddon Aug 29 '18 at 10:00

score 1 · Answer 3 · answered Aug 29 '18 at 09:53

You can use toolz.unique, or equivalently the unique_everseen recipe in the itertools docs, or equivalently @JoeIddon's explicit solution.

Here's the solution using 3rd party toolz:

x = "Hett, Agva, Delf, Agva, Hett,"

from toolz import unique

res = ', '.join(filter(None, unique(x.replace(' ', '').split(','))))

print(res)

'Hett, Agva, Delf'

I've removed whitespace and used filter to clean up a trailing , which may not be required.

score 1 · Answer 4 · answered Aug 29 '18 at 09:54

if you will receive a string in only this format then you can do the following:

import numpy as np

string_words=string.split(',')
uniq_words=np.unique(string_words)

string=""
for word in uniq_words:
    string+=word+", "
string=string[:-1]

what this code does is that it splits words into a list, finds unique items, and then merges them into a string like before

Nimeshka Srimal · Answer 5 · 2020-03-06T05:36:19.343

0

Quick and easy approach:

', '.join(
         set(
             filter( None, [ i.strip() for i in string.split(',') ] )
         )
     )

Hope it helps. Please feel free to ask if anything is not clear :)

edited Mar 06 '20 at 05:36

answered Aug 29 '18 at 10:53

Nimeshka Srimal

8,012
5
42
57

Searching for duplicates and remove them

5 Answers5