Remove duplicates while merging lists of lists into single list in python

Question

So, I have lists of lists like following:

data = [
['foo', 'bar'],
['one', 'two']
]

And, I want to flatten these lists by alternating between two lists. So, output like

flattened = ['foo', 'one', 'bar', 'two']

I am using the list(chain.from_iterable(zip_longest(*data))) which works fine.

But, I am trying to figure out how to handle scenarios where there are duplicates that I want to get rid of.

data = [
['foo', 'bar'],
['foo', 'two']
]

I want something like

flatted = ['foo', 'two', 'bar']

rather than ['foo', 'foo', 'bar', 'two']

How do I do this?

@Marat, well if they want to alternate between lists, then inherently yes. — Brian, Sep 23 '19 at 19:21
I don't have time at the moment but you can implement an [ordereddict](https://docs.python.org/3/library/collections.html#collections.OrderedDict) from your output list and then convert back or dict in 3.7+ — MyNameIsCaleb, Sep 23 '19 at 19:33
The correct output is `['foo', 'bar', 'bar'] `or `['foo', 'two', 'bar']`? — Massifox, Sep 23 '19 at 19:46

Alexander · Accepted Answer · 2019-09-23T20:07:51.387

4

Use a set to keep track of what you've already seen, which is an O(1) membership test.

result = []
seen = set()
for item in chain.from_iterable(zip_longest(*data)):
    if item not in seen:
        seen.add(item)
        result.append(item)
>>> result
['foo', 'bar', 'two']

Note that this question talks about removing duplicates from a list: Removing duplicates in lists

TL;DR

For Python 3.7+ (or Cython 3.6+):

>>> list(dict.fromkeys(chain.from_iterable(zip_longest(*data))))
['foo', 'bar', 'two']

edited Sep 23 '19 at 20:07

answered Sep 23 '19 at 19:39

Alexander

105,104
32
201
196

OP is requesting `['foo', 'two', 'bar'] `, not `['foo', 'bar', 'two'] ` – Brian Sep 23 '19 at 19:40
"something like" `['foo', 'two', 'bar']` – Alexander Sep 23 '19 at 19:41
yes I thought that initially as well but technically only 1 of those lists satisfies the values "alternating between two lists" as OP specified earlier. It's ambiguous at the end of the day so who knows. – Brian Sep 23 '19 at 19:43

Massifox · Answer 2 · 2019-09-23T22:21:00.520

0

Try this code:

list(dict.fromkeys(sum(data, [])))

EDIT: As pointed out in the comments, the sum is not the most efficient method to flatten a list, you can use itertools.chain.from_iterable to get the flattened list, then do the following:

list(dict.fromkeys(chain.from_iterable(data)))

In both cases the output is the following:

['foo', 'bar', 'two']

Comparison of execution times

Below they propose a comparison of the execution times of the main proposed solutions:

Benchmark data1:

 data1 = [['foo', 'bar'],['foo', 'two']] * 1000000

@Massifox's solution with `itertools.chain.from_iterable`

%%timeit
list(chain.from_iterable(data1))
# 128 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Marat's solution with `dedup`

%%timeit
list(dedup(chain.from_iterable(zip_longest(*data1))))
# 579 ms ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Alexander's solution with `zip_longest`

%%timeit
list(dict.fromkeys(chain.from_iterable(zip_longest(*data1))))
# 456 ms ± 149 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmark data2:

x = 10
y = 500000
n_max = 1000
data2 = [[np.random.randint(1, n_max) for _ in range(0, x)] for _ in range(0, y)]

@Massifox's solution with `itertools.chain.from_iterable`

%%timeit
list(chain.from_iterable(data2))
# 241 ms ± 20 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Marat's solution with `dedup`

%%timeit
list(dedup(chain.from_iterable(zip_longest(*data2))))
# 706 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Alexander's solution with `zip_longest`

%%timeit
list(dict.fromkeys(chain.from_iterable(zip_longest(*data2))))
# 674 ms ± 56.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The implementations based on sum and on Counter are decisively less efficient and take tens of seconds already with instances of smaller benchmark of [['foo', 'bar'],['foo', 'two']] * 100k.

With benchmark data1, the solution based on itertools.chain.from_iterable proposed by me seems to be about 4-5 times faster than the others.

With benchmark data2, the solution based on itertools.chain.from_iterable proposed by me seems to be about 2-3 times faster than the others.

edited Sep 23 '19 at 22:21

answered Sep 23 '19 at 19:35

Massifox

4,369
11
31

This is pretty good but `sum` is a terribly slow way to flatten a list. – miike3459 Sep 23 '19 at 19:35
This produces "bar" before it produces "two", OP is requesting that the opposite happens. – Brian Sep 23 '19 at 19:38
@BrianJoseph Probably the author of the post was wrong to expect that output, I'm asking for confirmation in the comments – Massifox Sep 23 '19 at 19:48
These timings tests are _so_ wrong... You have four million data points, but only three of which are unique. Furthermore, your timed solution gives the wrong result. – Alexander Sep 23 '19 at 21:16
are correct on my machine on that benchmark instance. If you have another instance to offer, write it in the comments that the text and add it to the comparison. Thanks – Massifox Sep 23 '19 at 21:22
@Alexander I updated my answer and added another benchmark instance with different features. I added the results. Let me know what you think. Thanks – Massifox Sep 23 '19 at 22:08
Better representation of the data, but your answer still gives the wrong result because it contains duplicates which still need to be removed. – Alexander Sep 23 '19 at 22:11
@Alexander, I updated the execution times of your solution, by mistake I still used the first instance instead of the second one. Why do you say the results are different? sorted (your_sol) == sorted (my_sol) gives True. – Massifox Sep 23 '19 at 22:25
Per your `timeit` function above, your result gives the wrong answer. `>>> list(chain.from_iterable(data)) # ['foo', 'bar', 'foo', 'two']` – Alexander Sep 23 '19 at 23:07

score 0 · Answer 3 · answered Sep 23 '19 at 19:36

Hmm, this might be a bit more overhead than you're looking for but this should work and guarantees list-wise order unlike sets:

from itertools import cycle
from collections import Counter

output = []
checker = Counter()

for lst in cycle(data):
    if not data:
        break

    while lst:
        item = lst.pop(0)
        if not checker[item]:
            output.append(item)
            checker[item] += 1
            break            

    if not lst:
        data.remove(lst)
        continue

output:

['foo', 'two', 'bar']

juancarlos · Answer 4 · 2019-09-23T19:33:37.473

-3

you can create a set and then convert to list again some like this:

l1 = ['foo', 'foo', 'bar', 'two']
l2 = list(set(l1))

it's create a second list without repeated items

if you want keept the order you can do this

ordered = dict.fromkeys(l1).keys()

edited Sep 23 '19 at 19:33

answered Sep 23 '19 at 19:25

juancarlos

593
3
9

Will this always maintain order? – MyNameIsCaleb Sep 23 '19 at 19:27
@MyNameIsCaleb Sets are unordered but you can probs find a OrderedSet datatype somewhere. – Joshua Nixon Sep 23 '19 at 19:28
Agree @JoshuaNixon which was my point. Dict does in 3.7+ based on insertion order. – MyNameIsCaleb Sep 23 '19 at 19:34
so, you can use same example using ordereddict instance of dict – juancarlos Sep 23 '19 at 19:47

Remove duplicates while merging lists of lists into single list in python

4 Answers4

Comparison of execution times

@Massifox's solution with itertools.chain.from_iterable

@Marat's solution with dedup

@Alexander's solution with zip_longest

@Massifox's solution with itertools.chain.from_iterable

@Marat's solution with dedup

@Alexander's solution with zip_longest

@Massifox's solution with `itertools.chain.from_iterable`

@Marat's solution with `dedup`

@Alexander's solution with `zip_longest`

@Massifox's solution with `itertools.chain.from_iterable`

@Marat's solution with `dedup`

@Alexander's solution with `zip_longest`