Removing duplicate content from a list of lists while not preserving any order

Question

I was writing a code in python to find factor pairs for an integer. But making pairs resulted in reverse pairs as well. I want to eliminate those reverse pairs using a simple method without importing any modules. for eg.

[[1, 200], [2, 100], [4, 50], [5, 40], [8, 25], [10, 20], [20, 10], [25, 8], [40, 5], [50, 4], [100, 2], [200, 1]]

the output should be:

[[1, 200], [2, 100], [4, 50], [5, 40], [8, 25], [10, 20]]

This is what I've got so far:

N = []
J = []
F = []
Z = []
S = []
num = input("Enter no. of elements in list")
print ('Enter numbers')
prod = 1
for i in range(int(num)):
    n = input("num :")
    N.append(int(n))
for x in N:
    prod = prod*x
print (prod)
k = input("Enter no. of splits:")
for o in range(1,prod+1):
    if prod%o == 0:
        J.append(o)
        F.append(o)
print (J)

Z = [[a, b] for a in J for b in F if a*b == prod]
print (Z)

Why don't you make sure that your algorithm does not produce the reverse pairs in the first place, by e.g. making sure that you only accept `[a, b]` when `a < b`? — CompuChip, Jul 22 '18 at 05:30
Why not iterate up to the square root of num instead of generating more numbers than necessary and then filtering? Seems wasteful to me — nadavvadan, Jul 22 '18 at 07:45

score 7 · Answer 1 · answered Jul 21 '18 at 14:07

7

Using set to remove duplicates.

Ex:

lst = [[1, 200], [2, 100], [4, 50], [5, 40], [8, 25], [10, 20], [20, 10], [25, 8], [40, 5], [50, 4], [100, 2], [200, 1]]
lst = set([tuple(sorted(i)) for i in lst])      #Sort inner list and then use set
lst = list(map(list, lst))                      #Converting back to list
print(lst)

Output:

[[8, 25], [4, 50], [1, 200], [10, 20], [2, 100], [5, 40]]

answered Jul 21 '18 at 14:07

Rakesh

81,458
17
76
113

are there any other ways to achieve this?? – Ritik Samaiya Jul 21 '18 at 14:13
2

I am not sure what you mean? – Rakesh Jul 21 '18 at 14:21

Håken Lid · Answer 2 · 2018-07-21T18:35:37.240

If the input is large, there is a significant performance advantage by using sets instead of lists.

>>> unique = set(map(frozenset, pairs))
>>> unique
{frozenset({1, 200}),
 frozenset({10, 20}),
 frozenset({5, 40}),
 frozenset({2, 100}),
 frozenset({8, 25}),
 frozenset({4, 50})}

The inner sets must be frozenset because regular sets are mutable, and sets can only contain immutable children.

To convert back to a list of lists.

>>> list(map(list, unique))
[[200, 1], [10, 20], [40, 5], [2, 100], [8, 25], [50, 4]]

Sets are iterable, so depending on your usage, this step might not be needed.

Here's a function that does both steps and returns the result as a nested list.

def unique_pairs(pairs): 
    return list(map(list,set(map(frozenset, pairs))))

Note that converting a list into a set will convert a list containing an identical pairs (for example [20,20]) to a single element set ({20}). So if your input can contain identical pairs, you might want to do an extra final step to expand singletons back to pairs.

def unique_pairs(pairs):
    return [(2*[*p])[:2] for p in set(map(frozenset,pairs))]

This will work with bot twin pairs and mixed pairs.

>>> pairs = [[10, 2], [2, 10], [10, 10], [2, 2]]
>>> unique_pairs(pairs)
[[10, 10], [2, 2], [10, 2]]

RoadRunner · Answer 3 · 2018-07-21T14:27:47.667

You can keep a set to keep track of what has been seen, and use a frozenset() to hash the lists into the seen set:

seen = set()
result = []
for sublst in lst:
    curr = frozenset(sublst)
    if curr not in seen:
        seen.add(curr)
        result.append(sublst)

print(result)

Which Outputs:

[[1, 200], [2, 100], [4, 50], [5, 40], [8, 25], [10, 20]]

If you later want to use libraries, you could use a collections.OrderedDict():

lst = [[1, 200], [2, 100], [4, 50], [5, 40], [8, 25], [10, 20], [20, 10], [25, 8], [40, 5], [50, 4], [100, 2], [200, 1]]

d = OrderedDict()
for sublist in lst:
    d.setdefault(frozenset(sublist), sublist)

print(list(d.values()))
# [[1, 200], [2, 100], [4, 50], [5, 40], [8, 25], [10, 20]]

Sunitha · Accepted Answer · 2018-07-21T17:08:57.987

4

>>> l = [[1, 200], [2, 100], [4, 50], [5, 40], [8, 25], [10, 20], [20, 10], [25, 8], [40, 5], [50, 4], [100, 2], [200, 1]]
>>> new_l = []
>>> for e in l:
...     if e not in new_l and sorted(e) not in new_l:
...         new_l.append(e)
... 
>>> new_l
[[1, 200], [2, 100], [4, 50], [5, 40], [8, 25], [10, 20]]
>>>

edited Jul 21 '18 at 17:08

answered Jul 21 '18 at 14:38

Sunitha

11,777
2
20
23

list reversal will not work for triple element lists. For eg. [1,2,1] as a case of [2,1,1]/[1,1,2]. – Ritik Samaiya Jul 21 '18 at 16:55
1

Updated the answer to use `sorted` instead of `reversed`. Can you check if this fixes your issue – Sunitha Jul 21 '18 at 17:09
This time it worked well. Last time I tried, it just gave the same list. But still, I am having a hard time figuring out how this line works.and why did it didn't work for `Sorted(Z)` and instead worked out for list(sorted(e)) – Ritik Samaiya Jul 21 '18 at 17:21
5

If performance matters, this solution is not ideal. O(n^2) compared to O(n) for solutions using a set instead of a list. If your input is small, this might not be noticeable, but for large inputs this can be very slow. – Håken Lid Jul 21 '18 at 18:41

score 2 · Answer 5 · answered Jul 21 '18 at 14:49

2

myList = [[1, 200], [2, 100], [4, 50], [5, 40], [8, 25], [10, 20], [20, 10], [25, 8], [40, 5], [50, 4], [100, 2], [200, 1]]
newList = myList.copy()
for i, j in newList:
    newList.remove([j,i])

print (newList)
#[[1, 200], [2, 100], [4, 50], [5, 40], [8, 25], [10, 20]]

answered Jul 21 '18 at 14:49

Transhuman

3,527
1
9
15

1

Modifying the thing you're iterating over is the stuff of my nightmares. – user1717828 Jul 22 '18 at 02:16

score 2 · Answer 6 · answered Jul 21 '18 at 14:50

You can use toolz.unique to maintain ordering at the outer level. If you don't have access to the 3rd party toolz library, you can use the unique_everseen recipe from the official docs.

L = [[1, 200], [2, 100], [4, 50], [5, 40], [8, 25], [10, 20],
     [20, 10], [25, 8], [40, 5], [50, 4], [100, 2], [200, 1]]

from toolz import unique

res = list(unique(map(tuple, map(sorted, L))))

print(res)

[(1, 200), (2, 100), (4, 50),
 (5, 40), (8, 25), (10, 20)]

Tuple conversion is required since unique uses hashing; and tuples are hashable while lists are not. If it's important you have a list of lists, you can apply an extra conversion:

res = list(map(list, unique(map(tuple, map(sorted, L)))))

At this stage, it's not particularly readable, so I suggest you split into a few steps:

sorter = map(sorted, L)
uniquify = unique(map(tuple, sorter))
res = list(map(list, uniquify))

Marcus.Aurelianus · Answer 7 · 2018-07-22T13:14:47.000

2

I am expecting downvote because my answer seems abit off topic.

In the first place, you only have to check value up to int((prod+1)**0.5)+1 which ensures no duplicates.

N = []
J = []
F = []
Z = []
S = []
num = input("Enter no. of elements in list: ")
print ('Enter numbers')
prod = 1
for i in range(int(num)):
    n = input("num :")
    N.append(int(n))
for x in N:
    prod = prod*x
print (prod)
k = input("Enter no. of splits:")
for o in range(1,int(prod**0.5)+1):
    if prod%o == 0:
        Z.append([o,prod//o])
print (Z)

Result:

Enter no. of elements in list: 1
Enter numbers
num :200
Enter no. of splits:0
[1, 2, 4, 5, 8, 10]
[[1, 200], [2, 100], [4, 50], [5, 40], [8, 25], [10, 20]]

edited Jul 22 '18 at 13:14

answered Jul 21 '18 at 15:00

Marcus.Aurelianus

1,520
10
22

Instead of the "o not in F" check, you could check for only divisibility and iterate values of o only up to the square root of prod rounded up + 1 (since any factors greater than the square root of the product will necessarily have been appended to F). This saves both loop iterations and repeated searches within the values in F (both of which could cause this code to run slowly for large values of prod). – redyoshi49q Jul 22 '18 at 05:27
Your edit throws an error; you can't use a float value as a bound for a range statement (I tested using http://www.pythontutor.com/visualize.html set to emulate Python 3.6). Using range(1,int(round(prod**0.5))+1) works as expected. – redyoshi49q Jul 22 '18 at 06:20
@redyoshi49q,ya, updated, forget the bound should be an integer. – Marcus.Aurelianus Jul 22 '18 at 06:21
Also note that Using range(1,int(round((prod+1)**0.5))) will /not/ work for edge cases. For instance, when prod=36, we want to iterate over the elements 1 through 6 inclusive; however, prod+1 == 37, and the square root of that rounds to 6, which range will /exclude/. – redyoshi49q Jul 22 '18 at 06:22
@redyoshi49q,, that is right, sir. Changed to int(prod**0.5)+1. – Marcus.Aurelianus Jul 22 '18 at 06:24

score 2 · Answer 8 · answered Jul 22 '18 at 11:40

It's actually quite tricky to get that right in a general way.

It essentially boils down to two basic problems:

Checking if two lists contain the same elements
Remove all lists that contain the same elements

I'll tackle these separately.

Check if two lists contain the same elements

I best to refer to Raymond Hettingers answer from here:

O(n): The Counter() method is best (if your objects are hashable):
from collections import Counter
def compare(s, t):
    return Counter(s) == Counter(t)
O(n log n): The sorted() method is next best (if your objects are orderable):
def compare(s, t):
    return sorted(s) == sorted(t)
O(n * n): If the objects are neither hashable, nor orderable, you can use equality:
def compare(s, t):
    t = list(t)   # make a mutable copy
    try:
        for elem in s:
            t.remove(elem)
    except ValueError:
        return False
    return not t

In your case you don't want any imports so you could replace collections.Counter with:

def count(it):
    d = {}
    for item in it:
        try:
            d[item] += 1
        except KeyError:
            d[item] = 1
    return d

Just in case the items are hashable and you don't care about the count of the items (for example [1,1,2] should be interpreted as equal to [1,2,2]) or they will always be unique then you could also use sets:

def compare(s, t):
    return set(s) == set(t)

So that way you can check if two sublists contain the same elements. There are possible optimizations in case you might have lists of different lengths, then it could be worthwhile to add a:

if len(s) != len(t):
    return False

At the beginning of each of these functions.

Removing duplicates from the list

That also depends on assumptions about the result (should the non-duplicates keep their relative order or not) and the contents (again, can you hash the contents or can they be ordered).

If the items are hashable (or could be converted to something hashable) you could use a set call to remove duplicates. If you care about the order you can still use a set but only for lookups, for example the recipe from the itertools documentation unique_everseen:

from itertools import filterfalse

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

You mentioned no imports but fortunately we don't need the key is None part anyway (see below) so you can simple use:

def unique_everseen(iterable, key):
    seen = set()
    seen_add = seen.add
    for element in iterable:
        k = key(element)
        if k not in seen:
            seen_add(k)
            yield element

Note that the approaches to compare the inner lists use sets, dictionaries and lists which are unhashable. But all of them can be converted to a hashable collections, like frozensets or tuples:

# for sets
frozenset(s)

# for dictionaries
frozenset(d.items())

# for lists
tuple(l)

However the last approach (if the items are unhashable and cannot be ordered) can't be used with this approach so let's ignore it for now.

Basically you could then use unique_everseen like this:

list(unique_everseen(your_list, key=lambda sublist: frozenset(count(sublist).items())))
# Or with collections.Counter instead of count

Or if you don't care about the duplicates (or there will be no duplicates) inside your sublists:

list(unique_everseen(your_list, key=frozenset))

Or if they are not hashable but can be ordered:

list(unique_everseen(your_list, key=lambda sublist: tuple(sorted(sublist))))

Just the approach in case the items in your sublist are not hashable and not orderable cannot be done using that fast unique_everseen approach. You'll have to use a slower variant:

def compare(s, t):
    t = list(t)   # make a mutable copy
    try:
        for elem in s:
            t.remove(elem)
    except ValueError:
        return False
    return not t

def unique_everseen_slow(iterable):
    seen = []
    for element in iterable:
        for already_seen_item in seen:
            if compare(element, already_seen_item):
                break  # We found a match, so stop looking
        else:
            seen.append(element)
            yield element

list(unique_everseen_slow(your_list))

The else clause belongs to the for loop and is only entered when there was no break. You could instead also check for any to avoid this for-else:

def unique_everseen_slow(iterable):
    seen = []
    for element in iterable:
        if not any(compare(element, seen_element) for seen_element in seen):
            seen.append(element)
            yield element

In your case it's actually very easy because the integers in the sublists are hashable and orderable. But this can become very complex for more general cases.

However in your case you could even avoid creating duplicate factors by simply checking that the factors are sorted (and if not stop):

def factors(number):
    for candidate in range(1, number + 1):
        if number % candidate == 0:
            other_factor = number // candidate
            if candidate > other_factor:
                return
            yield [candidate, other_factor]

For example:

>>> list(factors(200))
[[1, 200], [2, 100], [4, 50], [5, 40], [8, 25], [10, 20]]

score 1 · Answer 9 · answered Jul 21 '18 at 22:21

1

I think in this case you can use some domain knowledge here. Specifically, you will get the pairs [x, y] (I recommend you make this a tuple (specifically a 2-tuple, otherwise known as a pair) not an array), and [y,x] except when x=y where you get it only once. So you can write a simple function:

def unique_factors(all_factors):
  [ [a,b] for [a,b] in all_factors if a <= b ]

answered Jul 21 '18 at 22:21

Dan Robertson

4,315
12
17

What if I don't have pairs and have triplets and quadruples instead – Ritik Samaiya Jul 22 '18 at 07:52
1

If we’re still talking about factorisations then I’d map over your list of factorisations, sorting each one and then make a set out of the whole thing. – Dan Robertson Jul 22 '18 at 07:54

score 0 · Answer 10 · answered Jul 24 '18 at 16:11

lst = [[1, 200], [2, 100], [4, 50], [5, 40], [8, 25], [10, 20], [20, 10], [25, 8], [40, 5], [50, 4], [100, 2], [200, 1]]

[list(i) for i in set([tuple(sorted(i)) for i in l]))]

Answer:

[(8, 25), (4, 50), (1, 200), (10, 20), (2, 100), (5, 40)]

Explanation:

First we need to sort each list so that we can make duplicate list look similar. Then we need to convert each list into tuple so that we can use set() to eliminate duplicates.

We can use use List as it is since elements of list has to be hashable to use set().

set([tuple(sorted(i)) for i in l])

this gives us the set of all the elements with out duplicates. But its a set and each element is a tuple but they should be lists.

we can use list comprehension to convert the tuple elements into lists.

Removing duplicate content from a list of lists while not preserving any order

10 Answers10

Check if two lists contain the same elements

Removing duplicates from the list