Efficient way to remove half of the duplicate items in a list

Question

If I have a list say l = [1, 8, 8, 8, 1, 3, 3, 8] and it's guaranteed that every element occurs an even number of times, how do I make a list with all elements of l now occurring n/2 times. So since 1 occurred 2 times, it should now occur once. Since 8 occurs 4 times, it should now occur twice. Since 3 occurred twice, it should occur once.

So the new list will be something like k=[1,8,8,3]

What is the fastest way to do this? I did list.count() for every element but it was very slow.

did you try sorting and then just taking the odd position elements? — Gokul, Jul 08 '20 at 11:19
If order is important it must also be specified which items to retain: the first ones, the last ones, the evens, the odds, or something else? No word about it, so I assume the order is unimportant. — Florian F, Jul 09 '20 at 11:49
Technically, hashing (`O(n)`) is faster than sorting (`O(n log n`) beyond some point. — RBarryYoung, Jul 09 '20 at 17:58
@FlorianF Preserving the order doesn't necessarily mean there is a distinction between duplicates, as long as you don't change the relative order of non equal values. — chepner, Jul 09 '20 at 18:44
What an odd question! Does it come from real life or is it purely academic? — Daemon Painter, Jul 09 '20 at 18:54
@chepner Suppose you have [1,1,2,3,3,2,1,1] and you remove half of the duplicates, which one preserves the relative order: [1,2,3,1], [1,3,2,1], [1,1,2,3] or [3,2,1,1]? It depends whether you keep the 1st in each pair, the 2nd in each pair, the first half of identical numbers or the last half of identical numbers. — Florian F, Jul 10 '20 at 11:36

Wimanicesir · Accepted Answer · 2020-07-09T12:07:52.880

105

If order isn't important, a way would be to get the odd or even indexes only after a sort. Those lists will be the same so you only need one of them.

l = [1,8,8,8,1,3,3,8]
l.sort()

# Get all odd indexes
odd = l[1::2]

# Get all even indexes
even = l[::2]

print(odd)
print(odd == even)

Result:

[1, 3, 8, 8]
True

edited Jul 09 '20 at 12:07

answered Jul 08 '20 at 11:29

Wimanicesir

4,606
2
11
30

what does the list[1::2] part do? – NePtUnE Jul 08 '20 at 11:31
7

@NePt [understand slice notation](https://stackoverflow.com/questions/509211/understanding-slice-notation) – Patrick Artner Jul 08 '20 at 11:33
Also this post could help a lot: https://stackoverflow.com/questions/12433695/extract-elements-of-list-at-odd-positions – Wimanicesir Jul 08 '20 at 11:34
2

To be honest, I did not benchmark it. But I suppose a sort is faster than a count on a list. – Wimanicesir Jul 08 '20 at 11:46
7

@Wimanicesir I doubt that. Sorting is `O(n log n)`, counting should just be `O(n)`, albeit with a (probably) higher constant factor. – ApproachingDarknessFish Jul 08 '20 at 21:57
1

@Wimanicesir See test in my answer. Yours wins timing test when order is not important. Mine wins when order is important. Congrats. Clever approach. – jpf Jul 09 '20 at 00:41
1

@Wimanicesir why not starting from the first element with `list[::2]`? Also since this answer is getting attention you might want to rename the variable so that it does not shadow the python built-in. – Ma0 Jul 09 '20 at 11:58
@Ma0, as said in the text above, even or uneven will both work. Because it's the same list. I made an edit for the variables and added even as well. – Wimanicesir Jul 09 '20 at 12:09
1

@Wimanicesir exactly because it does not matter, going the extra mile to put a `1` in there is\can be a bit confusing. Other than that +1 – Ma0 Jul 09 '20 at 12:23
1

@ApproachingDarknessFish I'm not sure counting is O(n), since generally for sorting the order of comparisons may be O(n) to O(n log n) but for counting the order of comparisons can be up to O(1 + 2 + 3 ... + N) = O(N**2) (a lot is going on in `if i in count`). – jpf Jul 11 '20 at 11:59
@jpf I don't understand the context for that snippet. If `count` is a dictionary, then `i in count` is `O(1)`. – ApproachingDarknessFish Jul 13 '20 at 20:45
@ApproachingDarknessFish This pretty much explains my understanding of the `i in hash` for a dictionary (hash): https://stackoverflow.com/questions/48603244/why-is-a-hash-table-lookup-only-o1-time-when-searching-for-a-key-is-on. I've been told it's possible to have O(1) for `i in hash` but I haven't found a general explanation (only special cases). I do understand that `hash[i]` is O(1) because `i` can be mapped to a value with a hash function. But, again, determining whether any arbitrary key (existing or not) is in a hash appears to be O(N). This becomes O(N(N-1)/2) in current context. – jpf Jul 14 '20 at 00:32

score 28 · Answer 2 · 2020-07-08T11:23:19.823

28

Use a counter to keep track of the count of each element

from collections import Counter
l = [1,8,8,8,1,3,3,8]
res = []
count = Counter(l) # its like dict(1: 2, 8: 4, 3: 2)
for key, val in count.items():
    res.extend(val//2 * [key])
print(res)
# output
[1, 8, 8, 3]

edited Jul 08 '20 at 11:23

answered Jul 08 '20 at 11:19

If you make the list `[1,8,3,8,8,1,3,8]`, the output for this will be `[1, 8, 8, 3]` instead of `[1, 8, 3, 8]`. Or at least that's what I assume the output should be based on the example in the question. – Bernhard Barker Jul 08 '20 at 20:39
@BernhardBarker the order will be determined by the first time a number is encountered in the list, because in the latest versions of Python dicts are ordered by insertion order. Unless `Counter` doesn't use an actual dict. – Mark Ransom Jul 08 '20 at 21:45
@MarkRansom So it will be [1,8,8,3] like Bernhard said – user253751 Jul 09 '20 at 14:56
@user253751 the same comment was left on two answers, but I never saw a form of either answer with the other outcome so I had trouble knowing why that comment was left. Mine was really just a clarification of the mechanics for the curious. – Mark Ransom Jul 09 '20 at 15:15
@MarkRansom I suspect, based on the given example, that `[4,2,5,2]` (instead of the `[4,2,2,5]` this answer would give) is the correct output for `[4,2,5,2,2,4,5,2]`. This is based on taking the first half of elements of each value and keeping them in the same relative order. However, it could also be the case that it should be ordered by the position of the first element of each value, as this answer does, or that order isn't important at all. Only the asker can say for sure (based on the accepted answer, I guess order doesn't matter). Explaining what code in an answer does is useful though. – Bernhard Barker Jul 09 '20 at 19:02

jpf · Answer 3 · 2020-07-09T11:54:09.370

21

Since you guarantee that each element of the list occurs a multiple of 2, then it is faster to build the counter as you build the output list, rather than building a counter (or sort) first and using it later.

l = [1,8,8,8,1,3,3,8]
count={}
res=[]
for i in l:
  if i in count: count[i]+=1
  else: count[i]=1
  if count[i]%2: res.append(i)

print(res)

Output

[1,8,8,3]

EDIT Comparing time/expense of each method

Using the timeit module shows that this approach is 2.7 times faster than using a counter first.

i.e.

def one():
  l = [1,8,8,8,1,3,3,8]
  count={}
  res=[]
  for i in l:
    if i in count: count[i]+=1
    else: count[i]=1
    if count[i]%2: res.append(i)

  #print(res)


def two():
  from collections import Counter
  l = [1,8,8,8,1,3,3,8]
  res = []
  count = Counter(l) # its like dict(1: 2, 8: 4, 3: 2)
  for key, val in count.items():
    res.extend(val//2 * [key])

o=timeit.Timer(one)

t=timeit.Timer(two)

print(o.timeit(100000))

print(t.timeit(100000))

print(o.timeit(100000))

print(t.timeit(100000))

Output (seconds)

If order isn't important, then Wimanicesir's method would be preferred with 4x greater speedup, with result of 0.07037 (~11 times faster than with counter approach).

UPDATE I suspected that using the Counter method in two (unordered) may come with significant bloat or slow down in import, so I tested the "count first, compile result later" method while counting with the simple method here from one (ordered)

count={}
for i in l:
  if i in count: count[i]+=1
  else: count[i]=1

which was much faster than Counter. Replacing Counter in two of the tests defined resulted in a time of 0.31 instead of 0.80. Still slightly faster to compile (ordered) result during counting as in two, however. And much faster for unordered result to use Wimanicesir's method.

edited Jul 09 '20 at 11:54

answered Jul 08 '20 at 11:37

jpf

1,447
12
22

3

Due to the sort operation, @Wimanicesir's method runs in O(n log(n)) time, but the two methods you propose here run in O(n) time, so they ought to be faster. – snibbets Jul 09 '20 at 11:56
If the question is a designed exercise rather than a problem that arose in the wild, it's probable IMO that the expected answer is basically this, but perhaps using a set rather than a dictionary. If the element is absent, emit it and add it. If the element is present, remove it and do nothing. This will use a bit less memory, and for some inputs would use a *lot* less memory, than the dictionary. But I wouldn't count on it to be faster, since the structural changes to the set might well be slower than just incrementing a dictionary entry. – Steve Jessop Jul 09 '20 at 17:47
It also motivates the guarantee, " every element occurs an even number of times": if there are no elements without a matching pair later on, then you don't have to worry about emitting things as soon as you see them. – Steve Jessop Jul 09 '20 at 17:47
@snibbets For sorting, the number of comparisons made is anywhere from O(n) to O(n log(n)) [https://en.wikipedia.org/wiki/Timsort] whereas for building a counter, the number of comparisons made is up to O(1 + 2 + ... + N) = O(N*(N+1)) [i.e. in statement `for i in l: if i in count` is actually `for i in l: for c in count: (iteratively update count based on comparison "if c==i"):...`]. – jpf Jul 11 '20 at 11:51
@jpf, the Counter class is basically an extension of dict, so a key in `count` can be found in O(1) time (on average) using a hash map. – snibbets Jul 11 '20 at 18:14
It's not a particularly meaningful time comparison to only look at what happens with a really small number of elements (8). When you have so few elements, the speed difference is going to be negligible in most cases and it's much less likely to accurately represent what will happen for more elements. It will start being a lot more important when you get to thousands, millions or billions of elements or more (and then different approaches will likely be more or less efficient based on whether you have few of many different values or many of few different values). – Bernhard Barker Jul 12 '20 at 00:22
@snibbets I'm talking about within the context of the loop, the function is O(N**2). – jpf Jul 12 '20 at 11:41
@snibbets We're comparing the number of comparisons that must occur to fill the hash map (`if c in count`). This is on the order of the number of elements existing in the hash (`count`). Since you do this for every element of the list, then the max becomes, generally, 1 + 2 + ... + N-1 = N*(N-1)/2, where N is the number of elements in the list. – jpf Jul 12 '20 at 12:23
@jpf, the line `if c in count` can be (basically) be performed in constant time. The Counter class code does not search through every key in count until it finds the right one. This is because it uses a hash function to find the key: https://en.wikipedia.org/wiki/Hash_table So the running time is not O(n**2), unless the hash function is very poor. I suppose it may be O(n\*\*2) in the worst case, but this would be unusual. – snibbets Jul 12 '20 at 13:37
@snibbets Constant time is for when a key already exists in a table and the hash function therefore directly translates it to a result. But, to determine if any key exists in the table with `if c in count` is longer, with a max of O(N). Since we do this in a loop, then generally the time for searching would be max of N*(N-1)/2 (https://stackoverflow.com/questions/48603244/why-is-a-hash-table-lookup-only-o1-time-when-searching-for-a-key-is-on). [Of course here we've abandoned talking about the particular problem that guarantees at least 2 of each entry] – jpf Jul 12 '20 at 17:12
2

No lpf, to test if a key exists, a hash can be taken of the key and this is used to find the key in the table. Even if the key does not exist, the code can still quickly determine that it is not in the hash table. I suggest running the experiment yourself. I generated 10 million items in an array (range of values from 0 to 10\*\*6, duplicated and shuffled the array and then tried both methods. With a Counter it took 11.0 s, sorting the array first according to @Wimanicesir's method took 15.8 s. – snibbets Jul 12 '20 at 23:17
@snibbets I tried all three approaches with a randomly arranged array of 100000 elements (using more was too slow). Number of tests is 1000 per each. Not using Counter but filling a dictionary myself took 59 seconds, using Counter took 69 seconds, and using `.sort` took 34 seconds. I did a test where **only** Counter was used (without filling array based on values in `count.items()`) and it took 31s. This means you might be right about hash key lookups, but the timing of the actual problem includes modulus of 2 or shifting bits by 1 (divide by 2), and this is taking way longer than I'd expect. – jpf Jul 13 '20 at 11:51
@snibbets Even just looking at `count.items()` without doing anything with it takes another 12 seconds (total 43s using Counter without using modulus or divide by 2). – jpf Jul 13 '20 at 11:54
@snibbets Actually I think you claim it should take less time significantly, and even though I've handicapped sort by randomly sorting values, it takes about the same time. I'm really not seeing any evidence that what you're saying about checking if a key exists in a hash is true. – jpf Jul 13 '20 at 12:00
@snibbets However, for the purpose of answering the current post, this conversation digresses. The fact is, for the purpose of answering the original question, both counting methods take significantly longer than the sort, even when array is much longer. Sort wins this case unless you can come up with a different, faster method. – jpf Jul 13 '20 at 12:09
2

@jpf, as the number of elements grows I would expect the Counter method to be the faster option. I suggest trying both approaches with 20 million elements or more. As my results showed above, the sorting method was 44% slower with this many elements, and I would expect the difference to increase as the number of elements grows. – snibbets Jul 14 '20 at 15:28
@snibbets It doesn't seem you are testing the same thing I am testing. As I wrote, I tried each function with 100000 randomly arranged elements (generated outside function) and tested each function 1000 times apiece which resulted in ~1min run times. If I used 20 million elements it would take 200 minutes each test. This means you are probably testing each function once (20000000/100000*1min/(1000tests) = 0.2min). Any rate, somewhere we don't seem to be testing the same idea. Everywhere it appears checking if key exists in hash is max O(N), and doesn't seem to be a logical way around it. – jpf Jul 14 '20 at 22:30
@snibbets It's not really useful doing this so abstractly anyway. I know you have your tests and I have mine, but without scrutinizing each to make sure we're comparing apples to apples and no one's "cheating" or unknowingly introducing bias, it's impossible to come to a conclusion in these comments. Will test what I think you're testing at some point to satisfy my curiosity, but so far I suspect if I make sure I'm making fair comparisons, I'll find `if i in hash` is O(N). – jpf Jul 14 '20 at 22:39

Alex Essilfie · Answer 4 · 2020-07-15T16:23:42.583

This is a classic use case of sets and I'm quite surprised that no one else tried it out to see how it stacks against the Counter and dict implementations.

I implemented a solution using set instead as follows:

def set_impl(l):
  bag = set()
  res = []
  for i in l:
    if i in bag:
      res.append(i)
      bag.remove(i)
    else:
      bag.add(i)

This implementation is about 28% faster than using Counter and 51% faster than using a dictionary.

The sort and slice implementation given by Wimanicesir is the fastest, giving results 17 times faster than when using set. Note though, that because it sorts the items before removing duplicates, the order of appearance is not preserved unlike the other three.

Here are all the suggested implementations with timing for evaluation of comparative performance.
https://repl.it/@franzalex/StackOverflow-py#removeDuplicateHalf.py

import random
import statistics as stats
from collections import Counter as counter
from timeit import Timer

def slice_impl(l):
  l.sort()
  res = l[::2]

def dict_impl(l):
  count={}
  res=[]
  for i in l:
    if i in count:
      count[i] += 1
    else:
      count[i] = 1
    if count[i] % 2:
      res.append(i)

def counter_impl(l):
  count = counter(l)
  res = []
  for key, val in count.items():
    res.extend(val//2 * [key])

def set_impl(l):
  bag = set()
  res = []
  for i in l:
    if i in bag:
      res.append(i)
      bag.remove(i)
    else:
      bag.add(i)

def timed_run():
  for name, func in {"Sort and Slice": slice_impl, 
                     "Dictionary": dict_impl, 
                     "Counter": counter_impl, 
                     "Set": set_impl}.items():
    seq = list(range(50))*2
    results = []
    print(f"{name} Implementation Results")
    for i in range(50):
      if len(results) % 10: random.shuffle(seq) # shuffle after 10 runs
      results.append(Timer(lambda: func(seq)).timeit(10**4))
      # print(f"Run {i+1:02}: {results[i]:.6f}")
    print("")
    print(f"Median:  {stats.median(results):.6f}")
    print(f"Mean:    {stats.mean(results):.6f}")
    print(f"Std Dev: {stats.stdev(results):.6f}")
    print("\n\n")

timed_run()

Sample run result

Sort and Slice Implementation Results

Median:  0.009686
Mean:    0.009721
Std Dev: 0.000529


Dictionary Implementation Results

Median:  0.230081
Mean:    0.227631
Std Dev: 0.014584


Counter Implementation Results

Median:  0.192730
Mean:    0.194577
Std Dev: 0.008015


Set Implementation Results

Median:  0.149604
Mean:    0.151227
Std Dev: 0.006838

There are a few problems with this benchmark. For one, you only shuffle the list every ten iterations, so the sort algorithm gets an easy run 9 out of 10 times (TimSort is Ω(n) on already sorted lists). Moreover, your list has only 50 elements. The three algorithms based on hash maps/sets have a complexity of Ω(n), while the sort-based approach has a complexity of Ω(n log n). The difference will only matter for really big lists. It's unlikely that these problems change anything about your conclusion, though – I doubt it's practical to run this on a list big enough to make sort slower. — Sven Marnach, Aug 03 '20 at 13:50
@SvenMarnach: Mine's the only one that's throwing in a degree of complexity to affect the performance. Granted, shuffling a list of 50 items after every 10th iteration may not be much but just as you said, the real difference in performance can only be seen in very large sets which may not be very practical to use as demonstration here. That said, without having looked at the implementation of each of the hash-based data structures, it looks like the set trumps the other two based on the limitations of the tests I ran. — Alex Essilfie, Aug 05 '20 at 23:40

score 7 · Answer 5 · answered Jul 08 '20 at 23:54

7

Instead of using a counter, which keeps track of an integer for each possible element of the list, try mapping elements to booleans using a dictionary. Map to true the first time they're seen, and then every time after that flip the bit, and if it's true skip the element.

answered Jul 08 '20 at 23:54

Jad Ghalayini

87
1

This is very memory efficient. Note that you have to revert back to false after each 2nd occurrence (or remove them from the map altogether; this depends on whether your elements are sparse or not.) – Artelius Jul 09 '20 at 04:06
1

This would be my approach. If the element value was reasonably limited (under 100,000 or so) just use a bit array. Otherwise use a hash table. Copy the elements one at a time from source list to output list, copying or not based on the bit array. Order N. – Hot Licks Jul 09 '20 at 18:13
1

I like this answer, but I feel it needs an accompanying example. – Shaun Cockerill Jul 22 '20 at 01:34
1

This is like recreating the functionality of sets using a dict or bitarray. In Python it is likely to be faster simply to use a set. – Stuart Jul 23 '20 at 22:21

score 3 · Answer 6 · answered Jul 08 '20 at 11:20

If you are not concerned about preserving relative order, you can first get a count of each element using collections.Counter, then create a new list with each element duplicated half as many times.

>>> from collections import Counter
>>> from itertools import chain
>>> list(chain.from_iterable([key]*(count//2) for key, count in Counter(l).items()))
[1, 8, 8, 3]

the great me · Answer 7 · 2020-07-09T02:11:13.997

you keep a list of all items that have been visited an uneven number of times. then you iterate over all list items.

in other langauges would probably use some map() or filter() method, but here is some simple code since i don't know python well enough! :)

l = [1,8,8,8,1,3,3,8]
seen = []
result = []
for num in l:
  if num in seen:
    seen.remove(num)
    #result.append(num) #print every even appearance
  else:
    seen.append(num)
    result.append(num) #print every odd appearance

if len(seen)==0:
  print(result)
else:
  print("Error: uneven elements found:", seen)

at the end the visited-array should be empty, so you can use that as a sanity check before returning the result-array.

edit: here's a version with filter that returns the odd appearances

l = [1,8,8,8,1,3,3,8]
seen = []
result = list(filter(lambda x: seen.append(x) is None if x not in seen else not seen.remove(x) is None, l))

if len(seen)==0:
  print(result)
else:
  print("Error: uneven elements found:", seen)

and this one returns the even appearances:

l = [1,8,8,8,1,3,3,8]
seen = []
result = list(filter(lambda x: seen.remove(x) is None if x in seen else not seen.append(x) is None, l))

if len(seen)==0:
  print(result)
else:
  print("Error: uneven elements found:", seen)

Hello, and welcome to SO! Not sure about the downvote, probably due to the answer being extremely over-engineered for the task at hand. Please, do take a good habit of benchmark and post timings when the user asks for "what is the fastest". — Daemon Painter, Jul 09 '20 at 18:58

score 1 · Answer 8 · answered Jul 30 '20 at 21:25

I like using a trie set, as you need to detect duplicates to remove them, or a big hash set (lots of buckets). The trie does not go unbalanced and you do not need to know the size of the final set. An alternative is a very parallel sort -- brute force.

Andy Pyne · Answer 9 · 2020-07-24T11:11:56.817

I know this has been answered and there are some quite lengthy solutions. And it specifically mentioned Python. However, I thought a Powershell solution might be interesting (and simple!) for some:

Version 1 (grouping items - less efficient)

$OriginalArray = @("1","8","8","8","1","3","3","8") 
$NewArray = New-ObjectSystem.Collections.ArrayList
$ArrayGroup = $OriginalArray | Group-Object | Select-Object Count,Name

ForEach ($EachNumber in $ArrayGroup) {
    $HalfTheCount = (1..([Math]::Round($EachNumber.Count / 2)))
    ForEach ($Item in $HalfTheCount) {$NewArray.Add($EachNumber.Name) | Out-Null}   
    } 
$NewArray

Version 2 (picking every other item from a sorted array - more efficient)

$OriginalArray = @("1","8","8","8","1","3","3","8") 

$NewArray = New-Object System.Collections.ArrayList

$OddOrEven = "Even"
ForEach ($SortedItem in ($OriginalArray | Sort-Object)) {
    If ($OddOrEven -eq "Even") {$NewArray.Add($SortedItem);$EvenNumber = $True}
    If ($OddOrEven -eq "Odd") {$EvenNumber = $False}
    If ($EvenNumber -eq $True) {$OddOrEven = "Odd"} Else {$OddOrEven = "Even"} 
}
$NewArray

Your implementation is very inefficient. You are first grouping everything and counting the number per group, then you are 'generating' the correct number of elements via an intermediate array and then you are building a new array. Powershell is fine, just try to implement the same algorithm as the top answer suggest. — Dalibor Čarapić, Jul 23 '20 at 09:09
Fair enough - I've just added a script based on sorting and selecting every other item. The reason I did it by grouping originally is that you have flexibility to roundup/down if you have elements with an odd number of entries. Also, it was just a quick bit of code ;) — Andy Pyne, Jul 24 '20 at 11:09

score -1 · Answer 10 · answered Jul 15 '20 at 16:26

-1

import itertools

st=time.time()
lst = [1,8,8,8,1,3,3,8]
list(itertools.chain.from_iterable(itertools.repeat(x, int(lst.count(x)/2)) for x in list(set(lst)) if lst.count(x)%2==0))

This gives sorted list

answered Jul 15 '20 at 16:26

JUGAL KISHORE

111
1
6

score -1 · Answer 11 · answered Jul 21 '20 at 16:42

If the order matters, following code can work with O(N):

import collections
c = collections.Counter(l)
c2 = collections.Counter()
i, n = 0, len(l)
res=[]
for x in l:
    if i == n//2:break
    if c2[x] < c[x] // 2:
        res.append(x)
        c2[x] += 1
        i += 1

score -2 · Answer 12 · answered Jul 08 '20 at 11:25

-2

Maybe this.

    newList = []
    for number in l:
        if(newList.count(number) < l.count(number)/2):
            newList.append(number)

print(newList)

answered Jul 08 '20 at 11:25

faiz-e

190
4
21

Efficient way to remove half of the duplicate items in a list

12 Answers12