0

I have a huge python list, about 100 MB size with strings and integers. I have some strings as triplicates and duplicates. I have tried to remove duplicates with this code:

from collections import OrderedDict

duplicates = [.......large size list of 100 MB....]

remove = OrderedDict.fromkeys(duplicates).keys()

print remove

I have done with small size lists and it works good, with this large list, it has taken me a whole day and am not yet done. Any suggestions on how this can be done in minutes, ..fewer hrs??. I have tried CUDA installation in Ubuntu to work it out but I keep getting errors: see here

Community
  • 1
  • 1
lobjc
  • 2,751
  • 5
  • 24
  • 30
  • Do you mean you want to remove all of a duplicate or only the *others*. Say I give you `[a,b,a,c,a]`. Do you want `[a,b,c]` or `[b,c]`? – Willem Van Onsem Jan 31 '17 at 14:37
  • This sounds like a someone strange operation to want to do on a list. Are you sure you want to use a `list`? If you use a different data structure (such as a set), this kind of operation would be trivial. – Kevin Jan 31 '17 at 14:40
  • @willem, if i have ['a','b','a','c',], i want [ 'a','b','c',] – lobjc Jan 31 '17 at 15:33
  • @kevin, i tried a set with the same prolonged duration – lobjc Jan 31 '17 at 15:34
  • Why did you use a list in the first place? Is the order important? Instead of removing duplicates from a list, could you consider not adding the duplicates in the first place? Where do the data come from? Could you provide more information on what you're actually doing? – Vincent Savard Jan 31 '17 at 17:11
  • 1
    Do you need to preserve the order? If not, just use `remove = set(duplicates)` If order is important, as Raymond Hettinger says, there is no faster way than what you're doing: http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-whilst-preserving-order/39835527#39835527 – Chris_Rands Jan 31 '17 at 17:14

1 Answers1

0

Not sure if this is efficient enough, but one simple way to solve it is to cast your list into a set.

def unique(objects):
    return list(sorted(set(objects)))
lelabo_m
  • 509
  • 8
  • 21