3

I have an ordered list of things to process that includes some duplicates and I only want to process the first occurrence. Presently, I'm doing it like this in Python v2.7:

seen = set()
for (value, fmt) in formats:
  if fmt not in seen:
    seen.add(fmt)
    process(value, fmt)

Is there anyway to simultaneously insert a new element into seen and detect whether or not it was already present? (This would avoid the repeated lookup of fmt in the set.)

seen = set()
for (value, fmt) in formats:
  # myInsert() would return true if item was not already present.
  if seen.myInsert(fmt):
    process(value, fmt)

Or, alternatively, can I somehow just filter my formats to exclude duplicate entries before looping?

unique_formats = removeDuplicates(formats, key=itemgetter(1))
for (value, fmt) in unique_formats:
  process(value, fmt)
WilliamKF
  • 41,123
  • 68
  • 193
  • 295
  • 2
    Is the set lookup followed by `add()` really the overall performance bottleneck worth optimizing? – NPE Apr 04 '13 at 19:46
  • Not in this case, but I still want to know how to do this for the rare case where it does matter. Plus, I'd like the code to be more succinct with only one function call instead of two. – WilliamKF Apr 04 '13 at 19:52
  • I'd go for the first solution you provided but make the add function local. -> seen = set(), add = seen.add and call add(fmt). – SaCry Apr 04 '13 at 20:03
  • Please post a small part of your list in question body. – Ashwini Chaudhary Apr 04 '13 at 20:23
  • @SaCry I don't quite understand what you are suggesting, could you please elaborate in an answer? – WilliamKF Apr 04 '13 at 20:31
  • @AshwiniChaudhary I can't easily post a portion of the list due to the question being a simplified version and thus my data does not exactly match the question. Further, I'd prefer answers that are independent of my data. – WilliamKF Apr 04 '13 at 20:31
  • @WilliamKF But how's your list sorted, is it sorted based on the first element of the tuple or the second? – Ashwini Chaudhary Apr 04 '13 at 20:38
  • @AshwiniChaudhary It is sorted based on the first element of the tuple, the `value`. – WilliamKF Apr 04 '13 at 20:48
  • @WilliamKF Yeah sure. It just means that lookups in python are expensive. Seeing that "add" will be used quite often you could simply use a local variable that holds append. This saves you as mentioned a lookup per loop. More on that -> http://www.selenic.com/blog/?p=656 – SaCry Apr 04 '13 at 20:54

6 Answers6

2

You could take the length of the set before and after the add(). If it didn't change, the format was already in the set.

seen = set()
for (value, fmt) in formats:
    l1 = len(seen)
    seen.add(fmt)
    if l1 != len(seen):
         process(value, fmt)

Your question presumes that the in test is a costly operation. This turns out not to be the case. Using len() can take more time, although both are quite fast;

In [4]: seen = set(range(10000))

In [5]: %timeit 5995 in seen
10000000 loops, best of 3: 122 ns per loop

In [6]: %timeit len(seen)
1000000 loops, best of 3: 167 ns per loop

(measured with CPython 2.7.3 on a 2.5 GHz Core Quad Q9300)

Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • Well both `add()` and `elem in seen` are `O(1)` operation, so this doesn't improve anything. Except the fact that you used some extra memory here. – Ashwini Chaudhary Apr 04 '13 at 19:45
1

I think your first approach is the best. Even the unique_everseen recipe from itertools recipes is using the same approach.

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in ifilterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • IMHO this is the right approach. It's an iteration logic, so you wrap it up as a generator, write `for something in looper(whatever):`, and get on with the day. – DSM Apr 04 '13 at 19:55
0

You have to use a set:

for (value, fmt) in set(formats):
   process(value, fmt)
Ionut Hulub
  • 6,180
  • 5
  • 26
  • 55
0
from ordereddict import OrderedDict
unique_formats = list(OrderedDict.fromkeys(format))
process(unique_formats)

This will preserve the order and remove duplicates

cashmere
  • 2,811
  • 1
  • 23
  • 32
0

You can use itertools.groupby to group by the second element of the pair, and then only consider the first value.

>>> from itertools import imap, groupby
>>> from operator import itemgetter
>>> formats = [(1, 'xxx'), (2, 'xxx'), (3, 'yyy'), (4, 'yyy')]
>>> for fmt, value in imap(lambda (x, y): (x, next(y)[0]), groupby(formats, itemgetter(1)):
...    print('%s: %s', fmt, value)
...
xxx: 1
yyy: 3
Ismail Badawi
  • 36,054
  • 7
  • 85
  • 97
0

If your list is in order, you can be sure that identical formats will be adjacent to each other. This means you don't need to use a set to keep track of past value. Just use a single variable to record the last format processed:

last = None
for (value, fmt) in formats:
    if fmt != last:
        last = fmt
        process(value, fmt)
Blckknght
  • 100,903
  • 11
  • 120
  • 169
  • In case if the list is not sorted then this will take `O(NlogN)` time. – Ashwini Chaudhary Apr 04 '13 at 20:05
  • @AshwiniChaudhary No, if the list is not sorted it will still take `O(N)` time, but it will have processed some of the duplicates incorrectly. I'm relying on the questioner's assertion that the input is indeed sorted (by format), and presenting the simplest possible code that will work for that situation. The logic probably boils down to the same thing as @isbadawi's answer that uses `groupby`, but since mine doesn't need lambdas it will probably run slightly faster. – Blckknght Apr 04 '13 at 20:15
  • The list must be sorted as per the `fmt` element, otherwise this won't work. http://ideone.com/sX4RpB – Ashwini Chaudhary Apr 04 '13 at 20:22