10

I have a list of tuples, each tuple of which contains one string and two integers. The list looks like this:

x = [('a',1,2), ('b',3,4), ('x',5,6), ('a',2,1)]

The list contains thousands of such tuples. Now if I want to get unique combinations, I can do the frozenset on my list as follows:

y = set(map(frozenset, x))

This gives me the following result:

{frozenset({'a', 2, 1}), frozenset({'x', 5, 6}), frozenset({3, 'b', 4})}

I know that set is an unordered data structure and this is normal case but I want to preserve the order of the elements here so that I can thereafter insert the elements in a pandas dataframe. The dataframe will look like this:

 Name  Marks1  Marks2
0    a       1       2
1    b       3       4
2    x       5       6
enterML
  • 2,110
  • 4
  • 26
  • 38
  • Do you really have situations where (a, 1, 2) is presented as (2, a, 1) and they should be considered duplicates? As it would seem that just using `pd.Series(x).unique()` is what you're after... – Jon Clements Aug 29 '17 at 10:15
  • Otherwise... is (a, 1, 2) definitely the same as (a, 2, 1) - if so - which one gets preserved (a, 1, 2) or (a, 2, 1)? – Jon Clements Aug 29 '17 at 10:16
  • (a,1,2) and (a,2,1) are treated as same and only the first occurence should be preserved i.e. (a,1,2) – enterML Aug 29 '17 at 10:17
  • Youch - so depending on the input of your data your output is going to be different... – Jon Clements Aug 29 '17 at 10:21

4 Answers4

6

Instead of operating on the set of frozensets directly you could use that only as a helper data-structure - like in the unique_everseen recipe in the itertools section (copied verbatim):

from itertools import filterfalse

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

Basically this would solve the issue when you use key=frozenset:

>>> x = [('a',1,2), ('b',3,4), ('x',5,6), ('a',2,1)]

>>> list(unique_everseen(x, key=frozenset))
[('a', 1, 2), ('b', 3, 4), ('x', 5, 6)]

This returns the elements as-is and it also maintains the relative order between the elements.

MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • 3
    Of course, if `x` is a `pd.Series` object you can take advantage of that and use `x[x.apply(frozenset).drop_duplicates().index]` to take advantage of all the pandas goodness... – Jon Clements Aug 29 '17 at 10:31
  • Even though that's much shorter - it also would be slower (depending on the length of `x` either much slower or just a bit slower). At least if I remember correctly. – MSeifert Aug 29 '17 at 10:34
  • I get using the itertools recipe 37.3ns per loop and 12us per loop using the apply/drop_duplicates/index approach. So yeah - only a slightly difference :) – Jon Clements Aug 29 '17 at 10:40
  • Yes, I also had a ~300x improvement when using the recipe instead of pandas indexing for the short lists. For lists containing several ten-thousand tuples it's only 1.5-2 times faster. – MSeifert Aug 29 '17 at 10:42
  • That sounds more ball park from what I recall... Anyway - just throwing it out there 'cos if you're doing more than a straight dedupe at the same time, (and you've already got a series) - then keeping it in the pandas eco-system can sometimes be worth the cost. – Jon Clements Aug 29 '17 at 10:44
  • 1
    Bah... not much luck with `pd.DataFrame(((*row, frozenset(row)) for row in x), columns=['M1','M2','M3','K']).drop_duplicates(subset=['K'])` either - although it's not that much slower and you end up with the dataframe wanted... think I'll just give up now :) – Jon Clements Aug 29 '17 at 10:51
  • 2
    Could be useful as another answer though - given that the question is tagged with pandas :) Like you said: speed is nice but sometimes it's nicer to just use existing functions instead of rolling own functions. – MSeifert Aug 29 '17 at 10:52
1

No ordering with frozensets. You can instead create sorted tuples to check for the existence of an item, adding the original if the tuple does not exist in the set:

y = set()
lst = []
for i in x:
    t = tuple(sorted(i, key=str)
    if t not in y:
         y.add(t)
         lst.append(i)
print(lst)
# [('a', 1, 2), ('b', 3, 4), ('x', 5, 6)]

The first entry gets preserved.

Moses Koledoye
  • 77,341
  • 8
  • 133
  • 139
0

There are some quite useful functions in NumPy which can help you to solve this problem.

import numpy as np
chrs, indices = np.unique(list(map(lambda x:x[0], x)), return_index=True)
chrs, indices
>> (array(['a', 'b', 'x'], 
   dtype='<U1'), array([0, 1, 2]))
[x[indices[i]] for i in range(indices.size)]
>> [('a', 1, 2), ('b', 3, 4), ('x', 5, 6)]
misakawa
  • 41
  • 1
  • 6
0

You can do it by simple using the zip to maintain the order in the frozenset. Give this a try pls.

l = ['col1','col2','col3','col4']
>>> frozenset(l)
--> frozenset({'col2', 'col4', 'col3', 'col1'})
>>> frozenset(zip(*zip(l)))
--> frozenset({('col1', 'col2', 'col3', 'col4')})

Taking an example from the question asked:

>>> x = [('a',1,2), ('b',3,4), ('x',5,6), ('a',2,1)]
>>> frozenset(zip(*zip(x)))
--> frozenset({(('a', 1, 2), ('b', 3, 4), ('x', 5, 6), ('a', 2, 1))})
StevenzNPaul
  • 188
  • 12