47

I have a python list where elements can repeat.

>>> a = [1,2,2,3,3,4,5,6]

I want to get the first n unique elements from the list. So, in this case, if i want the first 5 unique elements, they would be:

[1,2,3,4,5]

I have come up with a solution using generators:

def iterate(itr, upper=5):

    count = 0
    for index, element in enumerate(itr):
        if index==0:
            count += 1
            yield element

        elif element not in itr[:index] and count<upper:
            count += 1
            yield element

In use:

>>> i = iterate(a, 5)
>>> [e for e in i]
[1,2,3,4,5]

I have doubts on this being the most optimal solution. Is there an alternative strategy that i can implement to write it in a more pythonic and efficient way?

jpp
  • 159,742
  • 34
  • 281
  • 339
xssChauhan
  • 2,728
  • 2
  • 25
  • 36

13 Answers13

52

I would use a set to remember what was seen and return from the generator when you have seen enough:

a = [1, 2, 2, 3, 3, 4, 5, 6]
    
def get_unique_N(iterable, N):
    """Yields (in order) the first N unique elements of iterable. 
    Might yield less if data too short."""
    seen = set()
    for e in iterable:
        if e in seen:
            continue
        seen.add(e)
        yield e
        if len(seen) == N:
            return
            
k = get_unique_N([1, 2, 2, 3, 3, 4, 5, 6], 4)
print(list(k))
    

Output:

[1, 2, 3, 4]

According to PEP-479 you should return from generators, not raise StopIteration - thanks to @khelwood & @iBug for that piece of comment - one never learns out.

With 3.6 you get a deprecated warning, with 3.7 it gives RuntimeErrors: Transition Plan if still using raise StopIteration


Your solution using elif element not in itr[:index] and count<upper: uses O(k) lookups - with k being the length of the slice - using a set reduces this to O(1) lookups but uses more memory because the set has to be kept as well. It is a speed vs. memory tradeoff - what is better is application/data dependend.

Consider [1, 2, 3, 4, 4, 4, 4, 5] vs [1] * 1000 + [2] * 1000 + [3] * 1000 + [4] * 1000 + [5] * 1000 + [6]:

For 6 uniques (in longer list):

  • you would have lookups of O(1)+O(2)+...+O(5001)
  • mine would have 5001*O(1) lookup + memory for set( {1, 2, 3, 4, 5, 6})
Gulzar
  • 23,452
  • 27
  • 113
  • 201
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
25

You can adapt the popular itertools unique_everseen recipe:

def unique_everseen_limit(iterable, limit=5):
    seen = set()
    seen_add = seen.add
    for element in iterable:
        if element not in seen:
            seen_add(element)
            yield element
        if len(seen) == limit:
            break

a = [1,2,2,3,3,4,5,6]

res = list(unique_everseen_limit(a))  # [1, 2, 3, 4, 5]

Alternatively, as suggested by @Chris_Rands, you can use itertools.islice to extract a fixed number of values from a non-limited generator:

from itertools import islice

def unique_everseen(iterable):
    seen = set()
    seen_add = seen.add
    for element in iterable:
        if element not in seen:
            seen_add(element)
            yield element

res = list(islice(unique_everseen(a), 5))  # [1, 2, 3, 4, 5]

Note the unique_everseen recipe is available in 3rd party libraries via more_itertools.unique_everseen or toolz.unique, so you could use:

from itertools import islice
from more_itertools import unique_everseen
from toolz import unique

res = list(islice(unique_everseen(a), 5))  # [1, 2, 3, 4, 5]
res = list(islice(unique(a), 5))           # [1, 2, 3, 4, 5]
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    The alternative would be creating an infinite generator and then `itertools.islice(gen, limit)` – Chris_Rands Dec 21 '18 at 16:33
  • Why not drop line 3 in you first block of code and do `seen.add(element)` instead? – gosuto Dec 27 '18 at 11:04
  • @jorijnsmit, It's an optimosation. One less lookup in each iteration of the for loop. You should notice the difference in very large loops. – jpp Dec 27 '18 at 11:28
  • This 2nd solution is the fastest one as can be seen [here](https://stackoverflow.com/a/67391971/3243159). – Jurgen Strydom May 04 '21 at 20:55
9

If your objects are hashable (ints are hashable) you can write utility function using fromkeys method of collections.OrderedDict class (or starting from Python3.7 a plain dict, since they became officially ordered) like

from collections import OrderedDict


def nub(iterable):
    """Returns unique elements preserving order."""
    return OrderedDict.fromkeys(iterable).keys()

and then implementation of iterate can be simplified to

from itertools import islice


def iterate(itr, upper=5):
    return islice(nub(itr), upper)

or if you want always a list as an output

def iterate(itr, upper=5):
    return list(nub(itr))[:upper]

Improvements

As @Chris_Rands mentioned this solution walks through entire collection and we can improve this by writing nub utility in a form of generator like others already did:

def nub(iterable):
    seen = set()
    add_seen = seen.add
    for element in iterable:
        if element in seen:
            continue
        yield element
        add_seen(element)
Azat Ibrakov
  • 9,998
  • 9
  • 38
  • 50
7

Here is a Pythonic approach using itertools.takewhile():

In [95]: from itertools import takewhile

In [96]: seen = set()

In [97]: set(takewhile(lambda x: seen.add(x) or len(seen) <= 4, a))
Out[97]: {1, 2, 3, 4}
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • 6
    By which definition is this abuse of the `or` operator considered *Pythonic*? – cdlane Dec 21 '18 at 16:42
  • 2
    @cdlane By the definition in which this use of `or` is misuse. – Mazdak Dec 21 '18 at 16:47
  • 1
    I think a proper function should be used instead of a lambda. Here the `seen.add` is not returning a boolean value, and still being used for truth checking. Your implementation saves us writing a generator function, which is welcome suggestion. But the `predicate` function should be more explicit. – xssChauhan Dec 21 '18 at 16:50
  • 3
    We've different concepts of *Pythonic*: [To be Pythonic is to use the Python constructs and data structures with clean, readable idioms.](https://blog.startifact.com/posts/older/what-is-pythonic.html) – cdlane Dec 21 '18 at 17:08
  • 3
    I disagree this is Pythonic, `seen.add or len(seen) <= 4` shouldn't be used in a function like `takewhile`, for the smae reasons you wouldn't use it in `map` or `filter` – juanpa.arrivillaga Dec 21 '18 at 17:38
6

You can use OrderedDict or, since Python 3.7, an ordinary dict, since they are implemented to preserve the insertion order. Note that this won't work with sets.

N = 3
a = [1, 2, 2, 3, 3, 3, 4]
d = {x: True for x in a}
list(d.keys())[:N]
Jindra Helcl
  • 3,457
  • 1
  • 20
  • 26
5

There are really amazing answers for this question, which are fast, compact and brilliant! The reason I am putting here this code is that I believe there are plenty of cases when you don't care about 1 microsecond time loose nor you want additional libraries in your code for one-time solving a simple task.

a = [1,2,2,3,3,4,5,6]
res = []
for x in a:
    if x not in res:  # yes, not optimal, but doesnt need additional dict
        res.append(x)
        if len(res) == 5:
            break
print(res)
grapes
  • 8,185
  • 1
  • 19
  • 31
  • me likey. straight forward, verbose, and with a few less lines. – pangyuteng Dec 21 '18 at 16:20
  • 2
    Use `set` rather than `list` for O(1) lookup. – jpp Dec 21 '18 at 16:20
  • 3
    @teng ... inefficient. – juanpa.arrivillaga Dec 21 '18 at 16:20
  • @juanpa.arrivillaga Yes, this is inefficient, I agree. Lists are not well suited for search operations (like `in` does), better use dictionaries. In fact I just participated for fun. But, to mention, there are certain cases when simple and understandable code matters, like educational purposes – grapes Dec 21 '18 at 16:23
  • @juanpa.arrivillaga inefficient compared to... OP's implementation? – pangyuteng Dec 21 '18 at 16:23
  • 1
    @teng similarly inefficient. – juanpa.arrivillaga Dec 21 '18 at 16:23
  • @juanpa.arrivillaga gotcha! ;) amazing this post got so much attention :p – pangyuteng Dec 21 '18 at 16:24
  • @Chris_Rands, Justification is good - you will need additional dictionary for that, and that increases the line number. That what I meant, of course not operator itself – grapes Dec 21 '18 at 16:27
  • 1
    @grapes but this is time-inefficient. Also, who cares about the line numbers? Do you suffer from a lack of lines? Didn'tsee your response to me. Yes, I agree, this implementation would work and is at least correct. I didn't downvote, btw. – juanpa.arrivillaga Dec 21 '18 at 16:30
  • @juanpa.arrivillaga, I know, i know, the best answer is using `set()`, as mentioned by other participants. I am not pretending this code to be the fastest, just compact and understandable. This simple question produced so much rush that I couldn't withstand a desire to participate – grapes Dec 21 '18 at 16:39
  • @juanpa.arrivillaga Yah just to be clear, I down voted for the reasons given above. Plus, this is not just a micro-optimisation, it will make a large difference for medium sized data sets (and there is no reason to avoid imports from the standard library) – Chris_Rands Dec 21 '18 at 18:08
5

Assuming the elements are ordered as shown, this is an opportunity to have fun with the groupby function in itertools:

from itertools import groupby, islice

def first_unique(data, upper):
    return islice((key for (key, _) in groupby(data)), 0, upper)

a = [1, 2, 2, 3, 3, 4, 5, 6]

print(list(first_unique(a, 5)))

Updated to use islice instead of enumerate per @juanpa.arrivillaga. You don't even need a set to keep track of duplicates.

cdlane
  • 40,441
  • 5
  • 32
  • 81
  • You might as well use `islice` – juanpa.arrivillaga Dec 21 '18 at 17:39
  • So `groupby` retains order, nice, but is it an implementation detail or a feature? – kubanczyk Dec 22 '18 at 09:39
  • 1
    @kubanczyk, yes `groupby` is mostly used with sorted data, where it becomes an aggregator. If the OP's data weren't sorted, `groupby` wouldn't work for this problem. However, `groupy` can be used with unsorted data to solve some other problems. In that case it can be used to detect when data changes. – cdlane Dec 22 '18 at 22:15
4

Using set with sorted+ key

sorted(set(a), key=list(a).index)[:5]
Out[136]: [1, 2, 3, 4, 5]
BENY
  • 317,841
  • 20
  • 164
  • 234
4

Given

import itertools as it


a = [1, 2, 2, 3, 3, 4, 5, 6]

Code

A simple list comprehension (similar to @cdlane's answer).

[k for k, _ in it.groupby(a)][:5]
# [1, 2, 3, 4, 5]

Alternatively, in Python 3.6+:

list(dict.fromkeys(a))[:5]
# [1, 2, 3, 4, 5]
pylang
  • 40,867
  • 14
  • 129
  • 121
2

Profiling Analysis

Solutions

Which solution is the fastest? There are two clear favorite answers (and 3 solutions) that captured most of the votes.

  1. The solution by Patrick Artner - denoted as PA.
  2. The first solution by jpp - denoted as jpp1
  3. The second solution by jpp - denoted as jpp2

This is because these claim to run in O(N) while others here run in O(N^2), or do not guarantee the order of the returned list.

Experiment setup

For this experiment 3 variables were considered.

  1. N elements. The number of first N elements the function is searching for.
  2. List length. The longer the list the further the algorithm has to look to find the last element.
  3. Repeat limit. How many times an element can repeat before the next element occurs in the list. This is uniformly distributed between 1 and the repeat limit.

The assumptions for data generation were as follows. How strict these are depend on the algorithm used, but is more a note on how the data was generated than a limitation on the algorithms themselves.

  1. The elements never occur again after its repeated sequence first appears in the list.
  2. The elements are numeric and increasing.
  3. The elements are of type int.

So in a list of [1,1,1,2,2,3,4 ....] 1,2,3 would never appear again. The next element after 4 would be 5, but there could be a random number of 4s up to the repeat limit before we see 5.

A new dataset was created for each combination of variables and and re-generated 20 times. The python timeit function was used to profile the algorithms 50 times on each dataset. The mean time of the 20x50=1000 runs (for each combination) were reported here. Since the algorithms are generators, their outputs were converted to a list to get the execution time.

Results

As is expected the more elements searched for, the longer it takes. This graph shows that the execution time is indeed O(N) as claimed by the authors (the straight line proves this).

Fig 1. Varying the first N elements searched for.

Fig 1. Varying the first N elements searched for.

All three solutions do not consume additional computation time beyond that which is required. The below image shows what happens when the list is limited in size, and not N elements. Lists of length 10k, with elements repeating a maximum of 100 times (and thus on average repeating 50 times) would on average run out of unique elements by 200 (10000/50). If any of these graphs showed an increase in computation time beyond 200 this would be a cause for concern.

Fig 2. The effect of first N elements chosen > number of unique elements.

Fig 2. The effect of first N elements chosen > number of unique elements.

The figure below again shows that processing time increases (at a rate of O(N)) the more data the algorithm has to sift through. The rate of increase is the same as when first N elements were varied. This is because stepping through the list is the common execution block in both, and the execution block that ultimately decides how fast the algorithm is.

Fig 3. Varying the repeat limit.

Fig 3. Varying the repeat limit.

Conclusion

The 2nd solution posted by jpp is the fastest solution of the 3 in all cases. The solution is only slightly faster than the solution posted by Patrick Artner, and is almost twice as fast as his first solution.

Jurgen Strydom
  • 3,540
  • 1
  • 23
  • 30
  • This is very useful information. Would it also be possible to add a memory consumption analysis? That way a user could also make a decision considering both their constraints. – xssChauhan May 05 '21 at 08:43
  • I agree, however in this case the information stored in all 3 of the functions are very similar. Furthermore, the dataset processed will be much larger than the information stored, so the memory used by the function is negligible in comparison. – Jurgen Strydom May 10 '21 at 07:12
1

Why not use something like this?

>>> a = [1, 2, 2, 3, 3, 4, 5, 6]
>>> list(set(a))[:5]
[1, 2, 3, 4, 5]
Aleksandr
  • 37
  • 5
0

Example list:

a = [1, 2, 2, 3, 3, 4, 5, 6]

Function returns all or count of unique items needed from list

1st argument - list to work with, 2nd argument (optional) - count of unique items (by default - None - it means that all unique elements will be returned)

def unique_elements(lst, number_of_elements=None):
    return list(dict.fromkeys(lst))[:number_of_elements]

Here is example how it works. List name is "a", and we need to get 2 unique elements:

print(unique_elements(a, 2))

Output:

output

Quanti Monati
  • 769
  • 1
  • 11
  • 35
0
a = [1,2,2,3,3,4,5,6]

from collections import defaultdict
def function(lis,n):
    dic = defaultdict(int)

    sol=set()

    for i in lis:
            try:
                if dic[i]:
                    pass
                else:
                    sol.add(i)
                    dic[i]=1
                    if len(sol)>=n:
                        break
            except KeyError:
                pass

    return list(sol)

print(function(a,3))

output

[1, 2, 3]
sahasrara62
  • 10,069
  • 3
  • 29
  • 44