How to process each item in an ordered list with duplicates only once in Python?

Question

I have an ordered list of things to process that includes some duplicates and I only want to process the first occurrence. Presently, I'm doing it like this in Python v2.7:

seen = set()
for (value, fmt) in formats:
  if fmt not in seen:
    seen.add(fmt)
    process(value, fmt)

Is there anyway to simultaneously insert a new element into seen and detect whether or not it was already present? (This would avoid the repeated lookup of fmt in the set.)

seen = set()
for (value, fmt) in formats:
  # myInsert() would return true if item was not already present.
  if seen.myInsert(fmt):
    process(value, fmt)

Or, alternatively, can I somehow just filter my formats to exclude duplicate entries before looping?

unique_formats = removeDuplicates(formats, key=itemgetter(1))
for (value, fmt) in unique_formats:
  process(value, fmt)

Is the set lookup followed by `add()` really the overall performance bottleneck worth optimizing? — NPE, Apr 04 '13 at 19:46
Not in this case, but I still want to know how to do this for the rare case where it does matter. Plus, I'd like the code to be more succinct with only one function call instead of two. — WilliamKF, Apr 04 '13 at 19:52
I'd go for the first solution you provided but make the add function local. -> seen = set(), add = seen.add and call add(fmt). — SaCry, Apr 04 '13 at 20:03
@SaCry I don't quite understand what you are suggesting, could you please elaborate in an answer? — WilliamKF, Apr 04 '13 at 20:31
@AshwiniChaudhary I can't easily post a portion of the list due to the question being a simplified version and thus my data does not exactly match the question. Further, I'd prefer answers that are independent of my data. — WilliamKF, Apr 04 '13 at 20:31
@WilliamKF But how's your list sorted, is it sorted based on the first element of the tuple or the second? — Ashwini Chaudhary, Apr 04 '13 at 20:38
@AshwiniChaudhary It is sorted based on the first element of the tuple, the `value`. — WilliamKF, Apr 04 '13 at 20:48
@WilliamKF Yeah sure. It just means that lookups in python are expensive. Seeing that "add" will be used quite often you could simply use a local variable that holds append. This saves you as mentioned a lookup per loop. More on that -> http://www.selenic.com/blog/?p=656 — SaCry, Apr 04 '13 at 20:54

Roland Smith · Accepted Answer · 2013-04-04T19:51:25.550

You could take the length of the set before and after the add(). If it didn't change, the format was already in the set.

seen = set()
for (value, fmt) in formats:
    l1 = len(seen)
    seen.add(fmt)
    if l1 != len(seen):
         process(value, fmt)

Your question presumes that the in test is a costly operation. This turns out not to be the case. Using len() can take more time, although both are quite fast;

In [4]: seen = set(range(10000))

In [5]: %timeit 5995 in seen
10000000 loops, best of 3: 122 ns per loop

In [6]: %timeit len(seen)
1000000 loops, best of 3: 167 ns per loop

(measured with CPython 2.7.3 on a 2.5 GHz Core Quad Q9300)

Well both `add()` and `elem in seen` are `O(1)` operation, so this doesn't improve anything. Except the fact that you used some extra memory here. — Ashwini Chaudhary, Apr 04 '13 at 19:45

score 1 · Answer 2 · answered Apr 04 '13 at 19:51

I think your first approach is the best. Even the unique_everseen recipe from itertools recipes is using the same approach.

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in ifilterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

IMHO this is the right approach. It's an iteration logic, so you wrap it up as a generator, write `for something in looper(whatever):`, and get on with the day. — DSM, Apr 04 '13 at 19:55

score 0 · Answer 3 · answered Apr 04 '13 at 19:34

0

You have to use a set:

for (value, fmt) in set(formats):
   process(value, fmt)

answered Apr 04 '13 at 19:34

Ionut Hulub

6,180
5
26
55

Also this will take unique *pairs*. If several values have the same `fmt` this won't do the same thing. – Ismail Badawi Apr 04 '13 at 19:37
Also, the set(ness) should only apply to the fmt, not the value. – WilliamKF Apr 04 '13 at 19:37

score 0 · Answer 4 · answered Apr 04 '13 at 19:47

0

from ordereddict import OrderedDict
unique_formats = list(OrderedDict.fromkeys(format))
process(unique_formats)

This will preserve the order and remove duplicates

answered Apr 04 '13 at 19:47

cashmere

2,811
1
23
32

1

But this doesn't guarantees that it'll only store the key that appeared first. – Ashwini Chaudhary Apr 04 '13 at 19:49

score 0 · Answer 5 · answered Apr 04 '13 at 19:53

0

You can use itertools.groupby to group by the second element of the pair, and then only consider the first value.

>>> from itertools import imap, groupby
>>> from operator import itemgetter
>>> formats = [(1, 'xxx'), (2, 'xxx'), (3, 'yyy'), (4, 'yyy')]
>>> for fmt, value in imap(lambda (x, y): (x, next(y)[0]), groupby(formats, itemgetter(1)):
...    print('%s: %s', fmt, value)
...
xxx: 1
yyy: 3

answered Apr 04 '13 at 19:53

Ismail Badawi

36,054
7
85
97

This will have trouble with non-contiguous repeats. – DSM Apr 04 '13 at 19:57
What if `formats` is something like this : `[(1, 'xxx'), (2, 'yyy'), (3, 'xxx'), (4, 'yyy')]` – Ashwini Chaudhary Apr 04 '13 at 19:59

score 0 · Answer 6 · answered Apr 04 '13 at 19:59

0

If your list is in order, you can be sure that identical formats will be adjacent to each other. This means you don't need to use a set to keep track of past value. Just use a single variable to record the last format processed:

last = None
for (value, fmt) in formats:
    if fmt != last:
        last = fmt
        process(value, fmt)

answered Apr 04 '13 at 19:59

Blckknght

100,903
11
120
169

In case if the list is not sorted then this will take `O(NlogN)` time. – Ashwini Chaudhary Apr 04 '13 at 20:05
@AshwiniChaudhary No, if the list is not sorted it will still take `O(N)` time, but it will have processed some of the duplicates incorrectly. I'm relying on the questioner's assertion that the input is indeed sorted (by format), and presenting the simplest possible code that will work for that situation. The logic probably boils down to the same thing as @isbadawi's answer that uses `groupby`, but since mine doesn't need lambdas it will probably run slightly faster. – Blckknght Apr 04 '13 at 20:15
The list must be sorted as per the `fmt` element, otherwise this won't work. http://ideone.com/sX4RpB – Ashwini Chaudhary Apr 04 '13 at 20:22

How to process each item in an ordered list with duplicates only once in Python?

6 Answers6