3

I have gigantic lists of objects with many duplicates (I'm talking thousands of lists with thousands of objects each, taking up to about 10million individual objects (already without duplicates).

I need to go through them and remove all the duplicates inside each list (no need to compare between lists, only inside each one).

I can, of course, go through the lists and compare with any dedupe algorithm that has been posted a lot of times, but I would guess this would take me forever.

I thought I could create an object with a crafted __hash__ method and use a list(set(obj)) to remove them but first: I don't know if this would work, second: I would still have to loop the lists to convert the elements to the new object.

I know Python is not the best solution for what I am trying to achieve, but in this case, it will have to be done in Python. I wonder what would be the best way to achieve this with the best performance possible.

Edit: for clarification: I have about 2k lists of objects, with about 5k objects inside each one (rough estimate). The duplicated objects are copies, and not references to the same memory location. The lists (dicts) are basically converted JSON arrays


Edit 2: I'm sorry for not being clear, I will rephrase.

This is for a django data migration, although my question only applies to the data 'formatting' and not the framework itself or database insertion. I inserted a whole bunch of data as JSON to a table for later analysis. Now I need to normalize it and save it correctly. I created new tables and need to migrate the data.

So when I retrieve the data from the db I have about 2000 JSON arrays. Applying json.loads(arr) (by the documentation) I get 2000 lists of objects (dicts). Each dict has only strings, numbers and booleans as values to each key, no nested objects/arrays, so something like this:

[
  {
    a: 'aa',
    b: 2,
    c: False,
    date: <date_as_long> // ex: 1471688210
  },
  {
    a: 'bb',
    b: 4,
    c: True,
    date: <date_as_long> // ex: 1471688210
  }
]

What I need is to run through every list and remove duplicates. Something is considered duplicate if all the fields except the date match (this wasn't in the original question, as I had not predicted it) inside a list. If they match across different lists, they are not considered duplicates.

After better analysis of the contents, I found out I have close to 2 million individual records (not 10 million as said previously). The performance problems I face are because each dict needs to suffer some sort of data formatting (converting dates, for example) and 'wrap' it in the model object for database insertion: ModelName(a='aaa', b=2, c=True, date=1471688210).

The insertion on the database itself is done by bulk_create.

NOTE: I'm sorry for the lack of clarification on the original question. The more I dug into this the more I learned about what had to be done and how to handle the data.


I accepted @tuergeist 's answer because it pointed to what I needed even with bad details on my part.

Given dicts cannot be hashed, thus I can't add them to a set(), my solution was to create a set() of tuples for the duplicated data, and verify the duplicates with it. This prevented an extra iteration if the duplicates where in a list.

So it was something like this:

data = [lots of lists of dicts]
formatted_data = []
duplicates = set()

for my_list in data:
  for element in my_list:
    a = element['a']
    b = convert_whatever(element['b'])
    c = element['c']

    d = (a, b, c) # Notice how only the elements that count for checking if it's a duplicate are here (not the date)

    if d not in duplicates:
      duplicates.add(d)
      normalized_data = {
        a: a,
        b: b,
        c: c,
        date: element['date']
      }
      formatted_data.append(MyModel(**normalized_data)

  duplicates.clear()

After this, for better memory management, I used generators:

data = [lots of lists of dicts]
formatted_data = []
duplicates = set()

def format_element(el):
  a = el['a']
  b = convert_whatever(el['b'])
  c = el['c']

  d = (a, b, c)

  if d not in duplicates:
    duplicates.add(d)
    normalized_data = {
      'a': a,
      'b': b,
      'c': c,
      'date': el['date']
    }
    formatted_data.append(MyModel(**normalized_data))

def iter_list(l):
  [format_element(x) for x in l]
  duplicates.clear()

[iter_list(my_list) for my_list in data]

Working code here: http://codepad.org/frHJQaLu

NOTE: My finished code is a little different (and in a functional style) than this one. This serves only as an example of how I solved the problem.


Edit 3: For the database insertion I used bulk_create. In the end it took 1 minute to format everything correctly (1.5 million unique entries, 225k duplicates) and 2 minutes to insert everything to the database.

Thank you all!

fgarci03
  • 330
  • 7
  • 15
  • Would you make your question clearer. What do you mean by duplicates? Are they referring to the same memory location, or a copy? You said list inside list, so they are tree like. I don't think the time complexity will be lower than O(n^2). Are you looking for an algorithm or looking for any existing function for you to do so? – Xiaojun Chen Aug 19 '16 at 07:46
  • Sorry, just edited the details.. it's not tree like, it's individual lists of objects.. each list has lots of duplicates and I need to clear them. Well if Python offers a function to it, better. But I just want to solve the problem in the best way possible :) – fgarci03 Aug 19 '16 at 07:52
  • What type are the elements of these lists? – ElmoVanKielmo Aug 19 '16 at 07:54
  • 1
    I don't think the size of your nested lists makes this infeasible with normal python methods and looping. If order does not matter, use `set()`, or else see this: http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-python-whilst-preserving-order. – Chris_Rands Aug 19 '16 at 07:55
  • I agree with @Chris_Rands - the crucial question is if elements of lists are hashable or implement `__cmp__` or `__eq__`? – ElmoVanKielmo Aug 19 '16 at 07:56
  • 1
    "I would guess this would take me forever", "I don't know if this would work" Can I suggest you try one of these methods, if it works - great. If not you can post your code here. – James K Aug 19 '16 at 07:57
  • @ElmoVanKielmo They *surely* don't implement `__cmp__` since that is gone in python3. They may implement `__lt__`,`__le__`, `__gt__`, or `__ge__`. In any case: the fundamental question is: does order matter? If it doesn't matter than using `list(set(..))` will be fine or use `[key for key,_ in groupby(sorted(seq))]`. – Bakuriu Aug 19 '16 at 07:58
  • Yeah, sorry, I missed python-3.x tag. But other comparison operators are there as you mentioned. – ElmoVanKielmo Aug 19 '16 at 08:01
  • 1
    By the way: I've tried the `list(set(..))` solution on 2k lists of 5k random integers taken from `0-100`. the timeit module reports 142 *milliseconds*. If the objects are not simple integers but more complex objects this will take more time but still I doubt that it will "take forever". – Bakuriu Aug 19 '16 at 08:04
  • PLEASE provide an example of your "lists" as you're writing about dict, arrays and whatsoever. – tuergeist Aug 19 '16 at 09:43
  • Sorry for the lack of detail, question edited – fgarci03 Aug 20 '16 at 11:33

4 Answers4

3

A fast, not order preserving solution for (hashable items) is

def unify(seq):
    # Not order preserving
    return list(set(seq))

Complete Edit

I assume, that you have dicts inside a list. And you have many lists. The solution to remove duplicates from a single list is:

def remove_dupes(mylist):
    newlist = [mylist[0]]
    for e in mylist:
        if e not in newlist:
            newlist.append(e)
    return newlist

A list here contains the following dicts. (But all random)

{"firstName":"John", "lastName":"Doe"}, 
{"firstName":"Anna", "lastName":"Smith"}, 
{"firstName":"Peter","lastName":"Jones"}

Running this, it took 8s for 2000 dicts on my MacBook (2,4GHz, i5)

Complete code: http://pastebin.com/NSKuuxUe

tuergeist
  • 9,171
  • 3
  • 37
  • 58
  • Dont the elements inside the list need to be hashable? Basically this is a JSON array of objects I converted to a dict – fgarci03 Aug 19 '16 at 08:18
  • You may provide example data instead of downvoting – tuergeist Aug 19 '16 at 08:22
  • I didn't downvote you, I wasn't even absolutely sure of what I commented, hence why I asked :( (unless I clicked it without noticing, but here it doesn't show up that I did) – fgarci03 Aug 19 '16 at 08:28
  • I also didn't downvote you but: 1. Make it for Python 3 as the question is tagged with Python 3. 2. It's not going to work with dict instances as list elements. – ElmoVanKielmo Aug 19 '16 at 08:57
  • On Py3 you're right. The "dict" edit is newer than my answer! And still, there is no example given yet – tuergeist Aug 19 '16 at 09:43
1

I'd suggest to have a sorted list (if possible), so you can be more precise when you want compare items (like a dictionnary I mean). A hash (or not) list can fulfill that thing.

If you have the ability to manage the "add and delete" from your lists, it's better ! Sort the new items each time you add/delete. (IMO good if you have hash list, forgot if you have linked list).

Complexity will of course depends on your structure (fifo/filo list, linked list, hash...)

1

Here is a solution for sorted lists:

class Solution:
def removeDuplicates(self, nums):
    """
    :type nums: List[int]
    :rtype: int
    """
    if (len(nums) == 0):
        return 0;
    j = 0
    for i in range(len(nums)):
        if (nums[i] != nums[j]):
            j = j+1
            nums[j] = nums[i];

    return j + 1
0

For compound objects (like lists in a list), the code below should suffice:

def unique(a):
i = 0
while i < len(a):
    p = a[i+1:]
    j = 0
    while j < len(p):
        if p[j]!=a[i]:
            j = j+1
        else:
            p.remove(p[j])
    a = a[:i+1] + p

    i = i+1
return a