0

Recently in an interview I was asked to write a python code to remove all duplicate entries from a list.

For example :

Input List = {1,2,4,5,2,3,1}
Expected Output List = {4,5,3}

In the above example, 1's & 2's appear more than once and they should be removed. And order preservation is important. This was the question asked.

Again he did not want me to use inbuilt functions like set(), unique() etc. He was testing my algorithm and ds skills, i guess. He had made this clear in the start.

I thought of 2 approaches on the spot. 1.) Sorting (complexity of nlog(n)) 2.) HashTable (faster that sorting)

HashTable Approach :

arr = [1,2,4,5,2,3,1]

//function : to create a hash table with key = arr[i] & value = occurence count
def dataCountTable(arr):
    countTable = {}
    i = 0
    while i<len(arr) :
        if arr[i] in countTable : 
            countTable[arr[i]] += 1
        else :
        countTable[arr[i]] = 1
    i+=1
return countTable

//function : to remove duplicates using the arr & hash table
def rmvAllDuplicate(arr, countTable):
    outList = list()
    i = 0
    while i<len(arr) :
        if countTable[arr[i]] == 1 :
            outList.append(arr[i]);
    i+=1
return outList

print rmvAllDuplicate(arr, dataCountTable(arr)) 

The interviewer did not seem impressed with this answer. And it kept me searching for a better aprooach. I could'nt find one.

If someone could help me improve my solution or suggest a new and a better solution, it will be nice!

Thanks!

sudhishkr
  • 3,318
  • 5
  • 33
  • 55
  • 1
    just use a set, if you dont bother about order – Hackaholic Nov 08 '14 at 00:46
  • making an edit - order preservation was part of the problem. And he did not want me to use inbuilt functions like unique() or something similar – sudhishkr Nov 08 '14 at 00:47
  • 1
    http://stackoverflow.com/questions/7961363/python-removing-duplicates-in-lists – mlwn Nov 08 '14 at 00:48
  • Also see [fastest algorithm for removing duplicates](http://stackoverflow.com/questions/89178/in-python-what-is-the-fastest-algorithm-for-removing-duplicates-from-a-list-so). It's not the same problem as yours, but it's a good basis for an argument that a hash-based approach will be fastest in practice. – abarnert Nov 08 '14 at 01:51

7 Answers7

4

I would use collection.Counter for this (or implement one on my own):

from collections import Counter
input_list = [1,2,4,5,2,3,1]
# expected_output_list = {4,5,3}

# make Counter object for list elements
# and pick up to list only those values for which count is 1 
singles = {x for x, count in Counter(input_list).items() if count == 1}
# filter your list to get only elements that were not duplicates
result = [x for x in input_list if x in singles]

Or as @falsetru pointed out, you can just:

result = [x for x, count in Counter(input_list).items() if count == 1]

But you have not guarantee of preserving order of your list in this case (h/t @DSM)

This has linear time complexity.

m.wasowski
  • 6,329
  • 1
  • 23
  • 30
  • 2
    The last list comprehension is redundant: `singles = [x for x, count in Counter(input_list).items() if count == 1]` – falsetru Nov 08 '14 at 00:54
  • 1
    It's not redundant: *And order preservation is important.* – DSM Nov 08 '14 at 00:57
  • Indeed, for list you can shorten it; I had removing duplicate values from dictionary in my head ;) – m.wasowski Nov 08 '14 at 00:57
  • 1
    This is a much more concise/readable/faster-by-a-smallish-constant version of what the OP has added as the "hash table approach" in his edit… – abarnert Nov 08 '14 at 01:00
3

You can do it in one line with list comprehension:

in_list = [1,2,4,5,2,3,1]
out_list = [num for num in in_list if in_list.count(num) == 1]
# result: [4,5,3]
101
  • 8,514
  • 6
  • 43
  • 69
3

Although your hash-table implementation could be made more concise, readable, and idiomatic, with a small boost in speed, I suspect that isn't what your interviewer was disappointed in.

More likely, he pushed you for a better solution in hopes that you would come up with an argument why your solution was effectively optimal, but instead, you searched for a while and gave up.

So, there are multiple things to prove here:

  1. Any solution to this problem has to be O(N) time.
  2. Your solution is (amortized, average-and-almost-always) O(N) time.
  3. The multiplier in your solution's time complexity is reasonable.
  4. Any solution to this problem that's O(N) time has to be O(M) space (where M is the number of distinct values).
  5. Your solution is O(M) space.
  6. The multiplier in your solution's space complexity is reasonable.

Even the easy ones you're not going to come up with an actual rigorous proof in an interview. And some of them, you may not even be able to make a compelling case—but raising possible exceptions and acknowledging where you're waving your hands may be sufficient. For example:

  • Python's dict (and set) has O(N) worst-case time; it's only O(1) amortized average-case. Is there anything about your data that would make it likely to be worse than O(1)? Probably not, but… what if this were part of a server that someone wanted to DoS, and they could send any input data they wanted?
  • All of the values he gave you were small integers. Is that guaranteed to be true? In that case, don't use a dict for your counts, just use list(range(0, P)) where P is the max number. Then it's O(P) space, which sounds worse than O(M), except that the multiplier will be a lot smaller—a list takes about 1/3rd the space (just values, rather than hashes, keys, and values), so if P << M/3 it's a win. And it's also probably a win on speed, because there's no need to keep hashing values. And you can do even better with array.array.
  • A Python hash table is overkill for storing both sets and dicts with small counts. Could a custom hash table cut things significantly, or not enough to be worth it?
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • By the way, personally, I think the "nothing but small integers" issue is very important, and I'd love to hear you ask me that right off the bat. But I've talked to people who do interviews at places like Google and Microsoft, they all tell me they'd just answer, "No, you can't assume anything about the data" and move on. Still, I don't care about you getting a job at Google or Microsoft as much as I care about you writing good code whoever hires you, so I'm still going to recommend you ask questions like that. – abarnert Nov 08 '14 at 01:45
  • This is a really nice explanation, thanks! - i will take this as the answer. – sudhishkr Nov 08 '14 at 09:57
2

I'm guessing if you're not allowed to use builtin functions, you're also not allowed to use stdlib classes. Otherwise, definitely use m.wasowski's answer.

But can you do the same thing yourself?

Sure, a Counter is just a fancy dict. You could implement your own Counter, or just do the same work explicitly:

input_list = [1,2,4,5,2,3,1]
counts = {}
for value in input_list:
    counts.setdefault(value, 0)
    counts[value] += 1

And now it's just the same as the rest of his code:

singles = {x for x, count in counts.items() if count == 1}
result = [x for x in input_list if x in singles]

This is effectively the same thing you're doing in your "hash table approach". But it is more concise, more readable, more idiomatic, and faster by a small but nonzero constant, so it may still have impressed the interviewer more. (And all of those things are even more true of m.wasowski's version, of course.)

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671
0

try: simple

l=[1,2,4,5,2,3,1]
[x for x in l if l.count(x)==1 ]

it will remove all unique items

Hackaholic
  • 19,069
  • 5
  • 54
  • 72
-1

Iterate over the list and have a marker for a specific element you have seen in the list. If you encounter the same element again the marker will be already set and you dont want to add the element to the list anymore. This would lead to a linear time algorithm. Guess thats why the interviewer was not happy about your solution. Hashing practically does the same but you are creating a huge overload for maintaining the hashtable.

def f(seq): 
   seen = {}
   result = []
   for item in seq:
       if item in seen: continue
       seen[item] = 1
       result.append(item)
   return result
kylieCatt
  • 10,672
  • 5
  • 43
  • 51
user2276094
  • 399
  • 1
  • 4
  • 11
  • This just uniquifies the list. That's not the problem. You'll give the output `[1, 2, 4, 5, 3]`, not `[4, 5, 3]`. (Also, using a dict to simulate a set instead of just using a set probably isn't going to impress an interviewer very much.) – abarnert Nov 08 '14 at 01:06
  • Ah snapbdidnt read the question properly. But you can use the same idea to get the duplicates. Also i thought seen was a set lol – user2276094 Nov 08 '14 at 01:28
  • Yeah, it's a bit annoying that `{}` isn't an empty set in Python… but if it _were_ a set, `seen[item] = 1` would be wrong; it's just `seen.add(item)`. – abarnert Nov 08 '14 at 01:37
-1

A set in Python is a collection without duplicated items. You can take a list and remove duplicated items by turning it into a set and then back to a list:

l = [1,2,3,2,4,3]
l = list(set(l))
print l
output: [1,2,3,4]
Amir Rachum
  • 76,817
  • 74
  • 166
  • 248
Vikas
  • 158
  • 15