1

So, I have lists of lists like following:

data = [
['foo', 'bar'],
['one', 'two']
]

And, I want to flatten these lists by alternating between two lists. So, output like

flattened = ['foo', 'one', 'bar', 'two']

I am using the list(chain.from_iterable(zip_longest(*data))) which works fine.

But, I am trying to figure out how to handle scenarios where there are duplicates that I want to get rid of.

data = [
['foo', 'bar'],
['foo', 'two']
]

I want something like

flatted = ['foo', 'two', 'bar'] 

rather than ['foo', 'foo', 'bar', 'two']

How do I do this?

frazman
  • 32,081
  • 75
  • 184
  • 269

4 Answers4

4

Use a set to keep track of what you've already seen, which is an O(1) membership test.

result = []
seen = set()
for item in chain.from_iterable(zip_longest(*data)):
    if item not in seen:
        seen.add(item)
        result.append(item)
>>> result
['foo', 'bar', 'two']

Note that this question talks about removing duplicates from a list: Removing duplicates in lists

TL;DR

For Python 3.7+ (or Cython 3.6+):

>>> list(dict.fromkeys(chain.from_iterable(zip_longest(*data))))
['foo', 'bar', 'two']
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • OP is requesting `['foo', 'two', 'bar'] `, not `['foo', 'bar', 'two'] ` – Brian Sep 23 '19 at 19:40
  • "something like" `['foo', 'two', 'bar']` – Alexander Sep 23 '19 at 19:41
  • yes I thought that initially as well but technically only 1 of those lists satisfies the values "alternating between two lists" as OP specified earlier. It's ambiguous at the end of the day so who knows. – Brian Sep 23 '19 at 19:43
0

Try this code:

list(dict.fromkeys(sum(data, [])))

EDIT: As pointed out in the comments, the sum is not the most efficient method to flatten a list, you can use itertools.chain.from_iterable to get the flattened list, then do the following:

list(dict.fromkeys(chain.from_iterable(data)))

In both cases the output is the following:

['foo', 'bar', 'two']

Comparison of execution times

Below they propose a comparison of the execution times of the main proposed solutions:

  1. Benchmark data1:

     data1 = [['foo', 'bar'],['foo', 'two']] * 1000000
    
    1. @Massifox's solution with itertools.chain.from_iterable

      %%timeit
      list(chain.from_iterable(data1))
      # 128 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
      
    2. @Marat's solution with dedup

      %%timeit
      list(dedup(chain.from_iterable(zip_longest(*data1))))
      # 579 ms ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
      
    3. @Alexander's solution with zip_longest

      %%timeit
      list(dict.fromkeys(chain.from_iterable(zip_longest(*data1))))
      # 456 ms ± 149 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
      
  2. Benchmark data2:

    x = 10
    y = 500000
    n_max = 1000
    data2 = [[np.random.randint(1, n_max) for _ in range(0, x)] for _ in range(0, y)]
    
    1. @Massifox's solution with itertools.chain.from_iterable

      %%timeit
      list(chain.from_iterable(data2))
      # 241 ms ± 20 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
      
    2. @Marat's solution with dedup

      %%timeit
      list(dedup(chain.from_iterable(zip_longest(*data2))))
      # 706 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
      
    3. @Alexander's solution with zip_longest

      %%timeit
      list(dict.fromkeys(chain.from_iterable(zip_longest(*data2))))
      # 674 ms ± 56.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
      

The implementations based on sum and on Counter are decisively less efficient and take tens of seconds already with instances of smaller benchmark of [['foo', 'bar'],['foo', 'two']] * 100k.

With benchmark data1, the solution based on itertools.chain.from_iterable proposed by me seems to be about 4-5 times faster than the others.

With benchmark data2, the solution based on itertools.chain.from_iterable proposed by me seems to be about 2-3 times faster than the others.

Massifox
  • 4,369
  • 11
  • 31
  • This is pretty good but `sum` is a terribly slow way to flatten a list. – miike3459 Sep 23 '19 at 19:35
  • This produces "bar" before it produces "two", OP is requesting that the opposite happens. – Brian Sep 23 '19 at 19:38
  • @BrianJoseph Probably the author of the post was wrong to expect that output, I'm asking for confirmation in the comments – Massifox Sep 23 '19 at 19:48
  • These timings tests are _so_ wrong... You have four million data points, but only three of which are unique. Furthermore, your timed solution gives the wrong result. – Alexander Sep 23 '19 at 21:16
  • are correct on my machine on that benchmark instance. If you have another instance to offer, write it in the comments that the text and add it to the comparison. Thanks – Massifox Sep 23 '19 at 21:22
  • @Alexander I updated my answer and added another benchmark instance with different features. I added the results. Let me know what you think. Thanks – Massifox Sep 23 '19 at 22:08
  • Better representation of the data, but your answer still gives the wrong result because it contains duplicates which still need to be removed. – Alexander Sep 23 '19 at 22:11
  • @Alexander, I updated the execution times of your solution, by mistake I still used the first instance instead of the second one. Why do you say the results are different? sorted (your_sol) == sorted (my_sol) gives True. – Massifox Sep 23 '19 at 22:25
  • Per your `timeit` function above, your result gives the wrong answer. `>>> list(chain.from_iterable(data)) # ['foo', 'bar', 'foo', 'two']` – Alexander Sep 23 '19 at 23:07
0

Hmm, this might be a bit more overhead than you're looking for but this should work and guarantees list-wise order unlike sets:

from itertools import cycle
from collections import Counter

output = []
checker = Counter()

for lst in cycle(data):
    if not data:
        break

    while lst:
        item = lst.pop(0)
        if not checker[item]:
            output.append(item)
            checker[item] += 1
            break            

    if not lst:
        data.remove(lst)
        continue  

output:

['foo', 'two', 'bar']
Brian
  • 1,572
  • 9
  • 18
-3

you can create a set and then convert to list again some like this:

l1 = ['foo', 'foo', 'bar', 'two']
l2 = list(set(l1))

it's create a second list without repeated items

if you want keept the order you can do this

ordered = dict.fromkeys(l1).keys()
juancarlos
  • 593
  • 3
  • 9