1

Please note this is not a duplicate of this post because I want to zip more than 2 lists (or at least I cannot easily generalize that post for use here without explicit loops)

I want to find the best performing (in terms of speed) implementation that merges list of lists in a particular way. The input is a list of lists (or tuples), ordered such that the length of the next list is always multiples of the previous one. For example:

a = ['A', 'B', 'C', 'D']
b = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
input_list = [a, b]

The output is a merged list of:

output = ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'A', 'E', 'B', 'F', 'C', 'G', 'D', 'H']

That is, the shorter lists (in this case a) all get expanded to the longest list (in this case b) by cycling over itself so that lists will have equal length. Then all lists are merged in a vertical stacking fashion.

Currently I have an implementation that essentially does the following:

step 1            step 2           step 3 
======            ========         ======
ABCD              ABCDABCD
ABCDEFGH -------> ABCDEFGH ------> AABBCCDDAEBFCGDH

It works but not efficient:

def flatten_list(value):
    return sum(value, [])

def group(value):
    for idx in reversed(range(1, len(value))):
        multiplier = int(len(value[idx]) / len(value[idx - 1]))
        if multiplier > 1:
            value[idx - 1] = flatten_list([value[idx - 1] for i in range(multiplier)])
    return flatten_list(list(zip(*value)))

Is there a faster implementation? Performance is really crucial to my application as the input can be huge. Any suggestions are appreciated!

gaow
  • 31
  • 5
  • 1
    To flatten list/iterator of tuples: https://stackoverflow.com/q/952914/2301450 `sum(..., [])` has quadratic complexity – vaultah Jan 19 '18 at 17:11
  • Thanks @vaultah but how about having more than 2 lists to zip? – gaow Jan 19 '18 at 17:26

4 Answers4

4

Using cycle, chain, islice from itertools:

list(chain(*islice(
    zip(*(cycle(l) for l in input_list)), 
    0, len(max(input_list, key=len)))))
# ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'A', 'E', 'B', 'F', 'C', 'G', 'D', 'H']

Or, in its parts:

# generator producing cycles of each list
(cycle(l) for l in input_list)  
# zip these cycles together: ('A', 'A') -> ('B', 'B') -> infinitely
zip(*...)  
# take a slice of this iterable of tuples with the length of the longest list
islice(..., 0, len(max(input_list, key=len)))
# chain those tuples together into one list
list(chain(*...))   

Or illustrated:

lists = [
                 #        chain--┐-----┐-----┐
                 #              ┌--┐  ┌--┐  ┌--┐     
                 #        | ┌-┐ | ┌-┐ | ┌-┐ | ┌-┐  | ┌-┐     
   [1],          # cycle: | |1|,| |1|,| |1|,| |1|, | |1|, ...
   [1, 2],       # cycle: | |1|,| |2|,| |1|,| |2|, | |1|, ...
   [1, 2, 3, 4], # cycle: | |1|,| |2|,| |3|,| |4|, | |1|, ...
]                #        | └-┘ | └-┘ | └-┘ | └-┘  | └-┘  
                 #        |  └--┘  └--┘  └--┘      |
                 #        | zip   zip   zip   zip  | zip  ...
                 #        |                        |    
                 #   islice start           islice stop
                 #   -->  [1,1,1,1,2,2,1,1,3,1,2,4]

The time complexity of this is O(n) where n is the length of the output list. In Python2 you will have to use itertools.izip instead of zip as the latter would try to build an infinite list.

user2390182
  • 72,016
  • 6
  • 67
  • 89
  • 1
    unless there's some sort of optimization under the hood, it should be (infinitesimally) faster to find max length with `max(map(len, input_list))` Since that operates on each list only once, while yours operates on each list once, then one list one more time to grab its length. Nice solution! I never use `islice`. – Adam Smith Jan 19 '18 at 17:45
  • 1
    @AdamSmith It should not make a big difference since the length of a list is retrieved in O(1). I timed both approaches and mine was even a little faster (insignificant): `stmt1 = 'len(max(lists, key=len))'; stmt2 = 'max(map(len, lists))'; timeit(stmt1, setup); 5.21139558899813; timeit(stmt2, setup); 5.373297284000728;` Btw, Mine calls `len` on each list only once, and then once more only for the longest one. – user2390182 Jan 19 '18 at 17:53
  • No, it'll barely make a *small* difference! But yes that's what I said above "operates on each list once, then one list [once more]" – Adam Smith Jan 19 '18 at 18:13
  • Thank you @schwobaseggl and Adam for the code, illustrations and discussions. I've adopted this solution in my program. Made a huge improvement! – gaow Jan 19 '18 at 18:50
2

Using the roundrobin itertools recipe:

Two Inputs

import itertools as it


a = list("ABCD")
b = list("ABCDEFGH")

list(it.chain.from_iterable(roundrobin(zip(it.cycle(a), b))))
# ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'A', 'E', 'B', 'F', 'C', 'G', 'D', 'H']

itertools.cycle() infinitely extends the shorter iterable. zip() stops iterating after the shorter iterable. roundrobin handles the interleaving of elements between iterables.


Longer Inputs

To work on more than two inputs, we need to cycle all but the last iterable:

def interleave(*inputs):
    *rest, last = inputs
    cycles = (it.cycle(x) for x in rest)
    return list(it.chain.from_iterable(mit.roundrobin(zip(*cycles, last))))

Now for two or more input iterables, we can apply the interleave function:

p = list("ab")
q = list("abcd")
r = list("abcdef")

input_list_1 = [a, b]
input_list_2 = [p, q, r]


print(interleave(*input_list_1))
# ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'A', 'E', 'B', 'F', 'C', 'G', 'D', 'H']
print(interleave(*input_list_2))
# ['a', 'a', 'a', 'b', 'b', 'b', 'a', 'c', 'c', 'b', 'd', 'd', 'a', 'a', 'e', 'b', 'b', 'f']

Note: you can either reimplement the roundrobin recipe from the docs or install a third-party library that implements it for you, e.g. more_itertools. > pip install more_itertools, then in Python, from more_itertools import roundrobin.

pylang
  • 40,867
  • 14
  • 129
  • 121
  • Fixed. Thanks for the comment. – pylang Jan 19 '18 at 17:55
  • 1
    @vaultah: That's fine, because we're given that the length of each list is a multiple of the length of the previous. The bigger issue is that this code is specific to two lists. – user2357112 Jan 19 '18 at 17:58
  • @user2357112 This code works for iterables. It is not constrained to lists alone. @vaultah This assumes the shorter iterable is `a`. – pylang Jan 19 '18 at 18:13
  • @user2357112 the limits to two iterables has been resolved with a second option. – pylang Jan 19 '18 at 18:48
1

apply itertools.cycle to all lists shorter than the longest one and zip them all together, flattening the result.

def special_zip(*lsts):
    max_len = max(map(len, lsts))
    cyclic = [lst if len(lst) == max_len else itertools.cycle(lst) for lst in lsts]
    return (el for grp in zip(*cyclic) for el in grp)

off-hand this should be O(2n) where n is the sum of the lengths of the lists. If you know the longest list and can pass it separately, this becomes O(n). Similarly if you know how much output you need (since you can just apply itertools.cycle to each list and pull from the infinite output)

Adam Smith
  • 52,157
  • 12
  • 73
  • 112
  • any list with max_len length need not be cycled, but it won't hurt if it is as long as one is bounded. However you're right about the flattening: let me do that. – Adam Smith Jan 19 '18 at 17:40
1

will this work?

final_result=[]
def recursive_approach(aa,bb):
    if not bb:
        return 0
    else:
        for i in zip(aa,bb):
            final_result.extend([i[0],i[1]])
    return recursive_approach(aa,bb[len(aa):])
recursive_approach(a,b)

print(final_result)

output:

['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'A', 'E', 'B', 'F', 'C', 'G', 'D', 'H']