remove all duplicates values from a list of list

Question

I want to remove all duplicates values from a list of list.

So I have a list of lists like this.

a=[['102 min', '', 'Comedy', 'User Rating: 6.6/10 (4,072 user ratings)', '69', 'Metascore', '', 'Rank:', '10', 'Showtimes:', 'Studio Movie Grill - Downey', '11:00 am', '', '1:35 pm', '', '4:10', '', '4:55', '', '7:40', '', '9:55', '', '10:35'], ['110 min', '', 'Comedy', '', 'Drama', '', 'Romance', 'User Rating: 8.1/10 (11,478 user ratings)', '73', 'Metascore', '', 'Rank:', '18', 'Showtimes:', 'Studio Movie Grill - Downey', '11:30 am', '', '2:10 pm', '', '4:45', '', '7:25', '', '10:00'], ['111 min', '', 'Action', '', 'Adventure', '', 'SciFi', 'User Rating: 6/10 (23,905 user ratings)', '44', 'Metascore', '', 'Rank:', '7', 'Showtimes:', 'Studio Movie Grill - Downey', '11:05 am', '', '1:50 pm', '', '4:35', '', '7:20', '', '10:05'], ['118 min', '', 'Action', '', 'Adventure', '', 'Drama', '', 'Fantasy', '', 'Thriller', 'User Rating: 6.8/10 (45,126 user ratings)', '48', 'Metascore', '', 'Rank:', '8', 'Showtimes:', 'Studio Movie Grill - Downey', '11:10 am', '', '1:55 pm', '', '4:40', '', '7:35', '', '10:20'], ['120 min', '', 'Thriller', 'User Rating: 4.9/10 (1,002 user ratings)', '32', 'Metascore', '', 'Rank:', '16', 'Showtimes:', 'Studio Movie Grill - Downey', '11:20 am', '', '2:05 pm', '', '4:50', '', '7:45', '', '10:40'], ['134 min', '', 'Action', '', 'Adventure', '', 'SciFi', 'User Rating: 7.8/10 (223,161 user ratings)', '88', 'Metascore', '', 'Rank:', '4', 'Showtimes:', 'Studio Movie Grill - Downey', '12:00 pm', '', '4:05', '', '7:15', '', '10:15'], ['140 min', '', 'Action', '', 'Adventure', '', 'SciFi', 'User Rating: 7.9/10 (76,138 user ratings)', '64', 'Metascore', '', 'Rank:', '1', 'Showtimes:', 'Studio Movie Grill - Downey', '11:45 am', '', '4:00 pm', '', '7:10', '', '10:10'], ['86 min', '', 'Animation', '', 'Adventure', '', 'Comedy', '', 'Family', '', 'Fantasy', '', 'Mystery', '', 'Romance', 'User Rating: 4.7/10 (1,275 user ratings)', '36', 'Metascore', '', 'Rank:', '75', 'Showtimes:', 'Studio Movie Grill - Downey', '11:00 am', '', '1:15 pm', '', '3:30', '', '5:45', '', '7:55'], ['90 min', '', 'Drama', '', 'Horror', '', 'Thriller', 'User Rating: 8.2/10 (28,256 user ratings)', '82', 'Metascore', '', 'Rank:', '2', 'Showtimes:', 'Studio Movie Grill - Downey', '11:15 am', '', '12:05 pm', '', '1:40', '', '2:30', '', '4:15', '', '6:40', '', '7:30', '', '9:05', '', '10:15']]

I want to have:

unique = [['102 min',  'Comedy', 'User Rating: 6.6/10 (4,072 user ratings)', '69', 'Metascore',  'Rank:', '10', 'Showtimes:', 'Studio Movie Grill - Downey', '11:00 am',  '1:35 pm',  '4:10',  '4:55',  '7:40',  '9:55',  '10:35'], ['110 min',  'Comedy',  'Drama',  'Romance', 'User Rating: 8.1/10 (11,478 user ratings)', '73', 'Metascore',  'Rank:', '18', 'Showtimes:', 'Studio Movie Grill - Downey', '11:30 am',  '2:10 pm',  '4:45',  '7:25',  '10:00'], ['111 min',  'Action',  'Adventure',  'SciFi', 'User Rating: 6/10 (23,905 user ratings)', '44', 'Metascore',  'Rank:', '7', 'Showtimes:', 'Studio Movie Grill - Downey', '11:05 am',  '1:50 pm',  '4:35',  '7:20',  '10:05'], ['118 min',  'Action',  'Adventure',  'Drama',  'Fantasy',  'Thriller', 'User Rating: 6.8/10 (45,126 user ratings)', '48', 'Metascore',  'Rank:', '8', 'Showtimes:', 'Studio Movie Grill - Downey', '11:10 am',  '1:55 pm',  '4:40',  '7:35',  '10:20'], ['120 min',  'Thriller', 'User Rating: 4.9/10 (1,002 user ratings)', '32', 'Metascore',  'Rank:', '16', 'Showtimes:', 'Studio Movie Grill - Downey', '11:20 am',  '2:05 pm',  '4:50',  '7:45',  '10:40'], ['134 min',  'Action',  'Adventure',  'SciFi', 'User Rating: 7.8/10 (223,161 user ratings)', '88', 'Metascore',  'Rank:', '4', 'Showtimes:', 'Studio Movie Grill - Downey', '12:00 pm',  '4:05',  '7:15',  '10:15'], ['140 min',  'Action',  'Adventure',  'SciFi', 'User Rating: 7.9/10 (76,138 user ratings)', '64', 'Metascore',  'Rank:', '1', 'Showtimes:', 'Studio Movie Grill - Downey', '11:45 am',  '4:00 pm',  '7:10',  '10:10'], ['86 min',  'Animation',  'Adventure',  'Comedy',  'Family',  'Fantasy',  'Mystery',  'Romance', 'User Rating: 4.7/10 (1,275 user ratings)', '36', 'Metascore',  'Rank:', '75', 'Showtimes:', 'Studio Movie Grill - Downey', '11:00 am',  '1:15 pm',  '3:30',  '5:45',  '7:55'], ['90 min',  'Drama',  'Horror',  'Thriller', 'User Rating: 8.2/10 (28,256 user ratings)', '82', 'Metascore',  'Rank:', '2', 'Showtimes:', 'Studio Movie Grill - Downey', '11:15 am',  '12:05 pm',  '1:40',  '2:30',  '4:15',  '6:40',  '7:30',  '9:05',  '10:15']]

I don't know how to do.

I tried following

unique = []
[unique.append(item) for item in a if item not in unique]

Thank you

Do you want duplicates removed just in the list they are in, or in the entire list of lists. — user3483203, Apr 10 '18 at 21:16
You might want to provide a more minimal example, It was not apparent what duplicates were being removed looking through that very long list. — user3483203, Apr 10 '18 at 21:18
It's no different than removing duplicates from a single list, just iterate over the list items and apply the same method. — zwer, Apr 10 '18 at 21:18
Does the order matter? If not, you can use a set of tuples or a list of sets (I'm not sure which level you're trying to remove duplicates at) instead of a list of lists, and duplicates will automatically not exist. — abarnert, Apr 10 '18 at 21:18
@Zwer, its different I tried the methods for example "set" and create new list by append no luck — spider22, Apr 10 '18 at 21:20
Have you tried OrderedDict? https://docs.python.org/2/library/collections.html#ordereddict-objects — Matheus Mohr, Apr 10 '18 at 21:23
@MatheusMohr from my understanding OrderedDict is for dictionaries not for list — spider22, Apr 10 '18 at 21:24
@Primusa: How will calling `sorted` help? That doesn't return the original order, it gives him a whole different one. (Plus, why waste time and space creating a list just to pass to `sorted`, when it can take a `set` just as easily?) — abarnert, Apr 10 '18 at 21:25
@spider22 You can use an `OrderedDict` with `None` for all the values as a quick&dirty ordered set—and an ordered set is the same thing as a list without duplicates. If you're going to use them more than once, though, you might want to consider using a complete `set`-like implementation like [Raymond Hettinger's recipe](https://github.com/ActiveState/code/blob/3b27230f418b714bc9a0f897cb8ea189c3515e99/recipes/Python/576696_OrderedSet_with_Weakrefs/README.md) or [the PyPI project `orderedset`](https://pypi.python.org/pypi/orderedset). — abarnert, Apr 10 '18 at 21:26
@abarnert exactly, as pointed out in the second answer here https://stackoverflow.com/questions/1653970/does-python-have-an-ordered-set — Matheus Mohr, Apr 10 '18 at 21:29
@abarnert 1. His order seemed to be sorted so I might as well sort it. 2. I didn't know that guess you learn something new every day — Primusa, Apr 10 '18 at 21:30
@MatheusMohr If I need something more fully set-like than just `OrderedDict((k, None) for …)`, I'd probably `pip install orderedset`, to get a fully tested and benchmarked library being used by other people in the field. But that definitely is a nice demonstration of how simple it is (especially now that almost nobody needs to worry about Python 2.6 or 3.1 anymore). — abarnert, Apr 10 '18 at 21:33

zwer · Answer 1 · 2018-04-10T23:49:04.483

It's easy to remove duplicates (keep only unique) from a list - all you need is to count your elements and then preserve only the ones that appear only once. You can use a temporary set to keep a track of already counted elements to optimize it a bit, so:

test_list = ['102 min', '', 'Comedy', 'User Rating: 6.6/10 (4,072 user ratings)', '69',
             'Metascore', '', 'Rank:', '10', 'Showtimes:', 'Studio Movie Grill - Downey',
             '11:00 am', '', '1:35 pm', '', '4:10', '', '4:55', '', '7:40', '', '9:55', '',
             '10:35']

seen = set()  # a temp set for a quick duplicates lookup
unique_list = [e for e in test_list
               if e not in seen and not seen.add(e) and test_list.count(e) == 1]

# ['102 min', 'Comedy', 'User Rating: 6.6/10 (4,072 user ratings)', '69', 'Metascore',
#  'Rank:', '10', 'Showtimes:', 'Studio Movie Grill - Downey', '11:00 am', '1:35 pm', '4:10',
#  '4:55', '7:40', '9:55', '10:35']

And since you have a list of lists the procedure is the same, you just need to apply it to the each element. So just migrate this into a function:

def get_unique(source):
    seen = set()  # a temp set for a quick duplicates lookup
    return [e for e in source
            if e not in seen and not seen.add(e) and source.count(e) == 1]

And then you can just iterate through your a list to get the uniques:

unique = [get_unique(e) for e in a]

If you want to strip just the duplicates (but keep at least one) all you need is to remove the source.count() check.

Keep in mind, tho, that this can get slow on very long lists as we're counting (essentially iterating over the whole list) for each new element we encounter. Instead, you can create a counter dict and then do in-place count lookup:

import collections

def get_unique(source):
    counter = collections.defaultdict(int)  # our counter dict
    for e in source:
        counter[e] += 1
    return [e for e in source if counter[e] == 1]

The extra in-Python iteration will pay off quickly for longer lists.

Why would you need a `set` in the second example with `defaultdict`? Shouldn't `return [e for e in source if counter[e] == 1]` be enough? — radzak, Apr 10 '18 at 22:07
great answer, +1! This should be the accepted one. The time complexity of the first code is `O(n^2)` in the worst case and `O(n)` in case of the second example with `defaultdict`, right? — radzak, Apr 11 '18 at 11:01
@Jatimir - The former is indeed `O(n²)`, the latter is `O(2*n)` (well, canonically speaking, `O(n)`). But the algorithm complexity is not all there is - the execution speed will depend on the underlying interpreter. For example, in CPython the first one (provided you alias `seen.add` and `source.count`) should end up running faster for shorter lists as all of the iterations and comparisons run on the _fast_ C side with no context switch with the _Python realm_, while the other uses a considerable amount of time on context switching during the set up and counting (but then runs blazingly fast). — zwer, Apr 11 '18 at 11:37

score -1 · Accepted Answer · answered Apr 10 '18 at 21:35

Here is what you are looking for.

you actually want to remove all the empty entries from lists :

unique =[]
tempList = []
a=[['102 min', '', 'Comedy', 'User Rating: 6.6/10 (4,072 user ratings)', '69', 'Metascore', '', 'Rank:', '10', 'Showtimes:', 'Studio Movie Grill - Downey', '11:00 am', '', '1:35 pm', '', '4:10', '', '4:55', '', '7:40', '', '9:55', '', '10:35'], ['110 min', '', 'Comedy', '', 'Drama', '', 'Romance', 'User Rating: 8.1/10 (11,478 user ratings)', '73', 'Metascore', '', 'Rank:', '18', 'Showtimes:', 'Studio Movie Grill - Downey', '11:30 am', '', '2:10 pm', '', '4:45', '', '7:25', '', '10:00'], ['111 min', '', 'Action', '', 'Adventure', '', 'SciFi', 'User Rating: 6/10 (23,905 user ratings)', '44', 'Metascore', '', 'Rank:', '7', 'Showtimes:', 'Studio Movie Grill - Downey', '11:05 am', '', '1:50 pm', '', '4:35', '', '7:20', '', '10:05'], ['118 min', '', 'Action', '', 'Adventure', '', 'Drama', '', 'Fantasy', '', 'Thriller', 'User Rating: 6.8/10 (45,126 user ratings)', '48', 'Metascore', '', 'Rank:', '8', 'Showtimes:', 'Studio Movie Grill - Downey', '11:10 am', '', '1:55 pm', '', '4:40', '', '7:35', '', '10:20'], ['120 min', '', 'Thriller', 'User Rating: 4.9/10 (1,002 user ratings)', '32', 'Metascore', '', 'Rank:', '16', 'Showtimes:', 'Studio Movie Grill - Downey', '11:20 am', '', '2:05 pm', '', '4:50', '', '7:45', '', '10:40'], ['134 min', '', 'Action', '', 'Adventure', '', 'SciFi', 'User Rating: 7.8/10 (223,161 user ratings)', '88', 'Metascore', '', 'Rank:', '4', 'Showtimes:', 'Studio Movie Grill - Downey', '12:00 pm', '', '4:05', '', '7:15', '', '10:15'], ['140 min', '', 'Action', '', 'Adventure', '', 'SciFi', 'User Rating: 7.9/10 (76,138 user ratings)', '64', 'Metascore', '', 'Rank:', '1', 'Showtimes:', 'Studio Movie Grill - Downey', '11:45 am', '', '4:00 pm', '', '7:10', '', '10:10'], ['86 min', '', 'Animation', '', 'Adventure', '', 'Comedy', '', 'Family', '', 'Fantasy', '', 'Mystery', '', 'Romance', 'User Rating: 4.7/10 (1,275 user ratings)', '36', 'Metascore', '', 'Rank:', '75', 'Showtimes:', 'Studio Movie Grill - Downey', '11:00 am', '', '1:15 pm', '', '3:30', '', '5:45', '', '7:55'], ['90 min', '', 'Drama', '', 'Horror', '', 'Thriller', 'User Rating: 8.2/10 (28,256 user ratings)', '82', 'Metascore', '', 'Rank:', '2', 'Showtimes:', 'Studio Movie Grill - Downey', '11:15 am', '', '12:05 pm', '', '1:40', '', '2:30', '', '4:15', '', '6:40', '', '7:30', '', '9:05', '', '10:15']]
for mylist in a :
    print(list)
    for elems in mylist:
        if elems !='':
            tempList.append(elems)
    unique.append(tempList)
    tempList = []


print(unique)

Desired output :

[['102 min', 'Comedy', 'User Rating: 6.6/10 (4,072 user ratings)', '69', 'Metascore', 'Rank:', '10', 'Showtimes:', 'Studio Movie Grill - Downey', '11:00 am', '1:35 pm', '4:10', '4:55', '7:40', '9:55', '10:35'], ['110 min', 'Comedy', 'Drama', 'Romance', 'User Rating: 8.1/10 (11,478 user ratings)', '73', 'Metascore', 'Rank:', '18', 'Showtimes:', 'Studio Movie Grill - Downey', '11:30 am', '2:10 pm', '4:45', '7:25', '10:00'], ['111 min', 'Action', 'Adventure', 'SciFi', 'User Rating: 6/10 (23,905 user ratings)', '44', 'Metascore', 'Rank:', '7', 'Showtimes:', 'Studio Movie Grill - Downey', '11:05 am', '1:50 pm', '4:35', '7:20', '10:05'], ['118 min', 'Action', 'Adventure', 'Drama', 'Fantasy', 'Thriller', 'User Rating: 6.8/10 (45,126 user ratings)', '48', 'Metascore', 'Rank:', '8', 'Showtimes:', 'Studio Movie Grill - Downey', '11:10 am', '1:55 pm', '4:40', '7:35', '10:20'], ['120 min', 'Thriller', 'User Rating: 4.9/10 (1,002 user ratings)', '32', 'Metascore', 'Rank:', '16', 'Showtimes:', 'Studio Movie Grill - Downey', '11:20 am', '2:05 pm', '4:50', '7:45', '10:40'], ['134 min', 'Action', 'Adventure', 'SciFi', 'User Rating: 7.8/10 (223,161 user ratings)', '88', 'Metascore', 'Rank:', '4', 'Showtimes:', 'Studio Movie Grill - Downey', '12:00 pm', '4:05', '7:15', '10:15'], ['140 min', 'Action', 'Adventure', 'SciFi', 'User Rating: 7.9/10 (76,138 user ratings)', '64', 'Metascore', 'Rank:', '1', 'Showtimes:', 'Studio Movie Grill - Downey', '11:45 am', '4:00 pm', '7:10', '10:10'], ['86 min', 'Animation', 'Adventure', 'Comedy', 'Family', 'Fantasy', 'Mystery', 'Romance', 'User Rating: 4.7/10 (1,275 user ratings)', '36', 'Metascore', 'Rank:', '75', 'Showtimes:', 'Studio Movie Grill - Downey', '11:00 am', '1:15 pm', '3:30', '5:45', '7:55'], ['90 min', 'Drama', 'Horror', 'Thriller', 'User Rating: 8.2/10 (28,256 user ratings)', '82', 'Metascore', 'Rank:', '2', 'Showtimes:', 'Studio Movie Grill - Downey', '11:15 am', '12:05 pm', '1:40', '2:30', '4:15', '6:40', '7:30', '9:05', '10:15']]

This is removing only empty elements, not necessarily duplicates. — zwer, Apr 10 '18 at 21:36
I agree but i don't see any duplicates in the list of lists , only thing that was duplicate is `' '`. The output is exactly what OP is asking for. — toheedNiaz, Apr 10 '18 at 21:38
Did the trick for me... but its correct only removing certain elements — spider22, Apr 10 '18 at 21:38
@toheedNiaz Yeah, sure... If the output of the example code in the question happens to be 'hello world'. Let's just submit an answer with `print('hello world')` no matter what the question was... — radzak, Apr 10 '18 at 21:47
@Jatimir really ? hello world ? Did the trick for me... means something in the context. — toheedNiaz, Apr 10 '18 at 21:52
@toheedNiaz yeah, but he's not the only one who may benefit from this question in the future and your answer does not solve the problem stated in the question. This being an accepted answer will rather mislead than help someone. What I meant with 'hello world' is that producing the expected output of the example code doesn't mean it solves the given problem. — radzak, Apr 10 '18 at 21:57

remove all duplicates values from a list of list

2 Answers2