1

Suppose that I have the following list of lists containing lists:

samples = [
    # First sample
    [
        # Think 'x' as in input variable in ML
        [
            ['A','E'], # Data
            ['B','F']  # Metadata
        ],
        # Think 'y' as in target variable in ML
        [
            ['C','G'], # Data
            ['D','H'], # Metadata
        ]
    ],
    # Second sample
    [
        [
            ['1'],
            ['2']
        ],
        [
            ['3'],
            ['4']
        ]
    ]
]

The output that I'm after looks like the following:

>>> samples
[
    ['A','E','1'], # x.data
    ['B','F','2'], # x.metadata
    ['C','G','3'], # y.data
    ['D','H','4']  # y.metadata
]

My question is that does there exist a way to utilize Python's zip function and maybe some list comprehensions to achieve this?

I have searched for some solutions, but for example this and this deal with using zip to address different lists, not inner lists.

A way to achieve this could very well be just a simple iteration over the samples like this:

x,x_len,y,y_len=[],[],[],[]

for sample in samples:
    x.append(sample[0][0])
    x_len.append(sample[0][1])
    y.append(sample[1][0])
    y_len.append(sample[1][1])

samples = [
    x,
    x_len,
    y,
    y_len
]

I'm still curious if there exists a way to utilize zip over for looping the samples and their nested lists.

Note that the data and metadata can vary in length across samples.

Georgy
  • 12,464
  • 7
  • 65
  • 73
  • Are the innermost lists always of length 1? – AKX Jul 25 '19 at 12:40
  • 1
    Can you modify what creates the samples in the first place? – Sayse Jul 25 '19 at 12:43
  • @AKX No, they are not. They are actually variable length tensors. That, however, might not be relevant here. I'm just interested in finding out if there exists a way to combine data from similarly structured list objects. – Petteri Nevavuori Jul 25 '19 at 12:43
  • @Sayse I'm unfortunately quite limited with my options due to using two separate frameworks with my data (`torch` and `skorch`). – Petteri Nevavuori Jul 25 '19 at 12:44

4 Answers4

2

IIUC, one way is to use itertools.chain to flatten the results of zip(samples):

from itertools import chain

new_samples = [
    list(chain.from_iterable(y)) for y in zip(
        *((chain.from_iterable(*x)) for x in zip(samples))
    )
]

print(new_samples)
#[['A', 'E', '1'], ['B', 'F', '2'], ['C', 'G', '3'], ['D', 'H', '4']]

Step by step explanation

1) First call zip on samples:

print(list(zip(samples)))
#[([[['A', 'E'], ['B', 'F']], [['C', 'G'], ['D', 'H']]],),
# ([[['1'], ['2']], [['3'], ['4']]],)]

Notice that in the two lines in the output above, if the elements were flattened, you'd have the structure needed to zip in order to get your final results.

2) Use itertools.chain to flatten (which will be much more efficient than using sum).

print([list(chain.from_iterable(*x)) for x in zip(samples)])
#[[['A', 'E'], ['B', 'F'], ['C', 'G'], ['D', 'H']],
# [['1'], ['2'], ['3'], ['4']]]

3) Now call zip again:

print(list(zip(*((chain.from_iterable(*x)) for x in zip(samples)))))
#[(['A', 'E'], ['1']),
# (['B', 'F'], ['2']),
# (['C', 'G'], ['3']),
# (['D', 'H'], ['4'])]

4) Now you basically have what you want, except the lists are nested. So use itertools.chain again to flatten the final list.

print(
    [
        list(chain.from_iterable(y)) for y in zip(
            *((chain.from_iterable(*x)) for x in zip(samples))
        )
    ]
)
#[['A', 'E', '1'], ['B', 'F', '2'], ['C', 'G', '3'], ['D', 'H', '4']]
pault
  • 41,343
  • 15
  • 107
  • 149
0

You could do:

res = [[y for l in x for y in l] for x in zip(*([x for var in sample for x in var] for sample in samples))]

print([list(i) for i in res])

Gives on your example:

[['A', 'E', '1'], ['B', 'F', '2'], ['C', 'G', '3'], ['D', 'H', '4']]

This basically flattens each "sample" to a list and packs that in a big list, then unbpacks that into zip and then packs each zipped element to a list.

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
  • What if the data varies in length across samples? Try setting `samples[0][0][0] = ['A','E']` for example. This solution only returns the first element of the most inner lists. – Petteri Nevavuori Jul 25 '19 at 12:59
  • 1
    you didn't mention how it should handle such situation. should it be: `[['A', 'E'], '1']` or `['A', 'E', '1']`? – Tomerikoo Jul 25 '19 at 13:00
  • You are correct and I apologize for that. For some reason I first thought that it shouldn't matter, but obviously it does. I have edited the question to reflect what I'm expecting as the output. – Petteri Nevavuori Jul 25 '19 at 13:03
  • Well, fixed it. Now it works for all lenghts in any sample, just got a little messy :) – Tomerikoo Jul 25 '19 at 13:16
0

Here's another solution. Quite ugly, but it does use zip, even twice!

>>> sum(map(lambda y: list(map(lambda x: sum(x, []), zip(*y))), zip(*samples)), [])
[['A', '1'], ['B', '2'], ['C', '3'], ['D', '4']]

It is interesting to see how it works, but please don't actually use it; it is both hard to read and algorithmically bad.

L3viathan
  • 26,748
  • 2
  • 58
  • 81
  • This actually works with varying length samples! It is indeed really hard to read though. – Petteri Nevavuori Jul 25 '19 at 13:06
  • 1
    @PetteriNevavuori OP mentioned that you shouldn't do this, but the reason is because using [using `sum(x, [])` to flatten a list takes quadratic time and is really inefficient](https://stackoverflow.com/a/49887692/5858851). – pault Jul 25 '19 at 13:56
  • Especially relevant since @PetteriNevavuori uses "variable-length tensors", in which case this might actually matter, as they could be quite long in an ML context. – L3viathan Jul 25 '19 at 14:36
  • Now that @pault changed his response a bit, his response is most clear to me. Thanks for giving your solution though! And efficiency does indeed matter in my context. – Petteri Nevavuori Jul 26 '19 at 05:09
0

Not the most comfortable data structure to work with you have there. I would advise to refactor the code and choose something else than 3-times nested lists to keep the data, but if it is currently not possible, I suggest the following approach:

import itertools


def flatten(iterable):
    yield from itertools.chain.from_iterable(iterable)


result = []
for elements in zip(*map(flatten, samples)):
    result.append(list(flatten(elements)))

For your example it gives:

[['A', 'E', '1'], 
 ['B', 'F', '2'], 
 ['C', 'G', '3'], 
 ['D', 'H', '4']]

Test for more than 2 samples:

samples = [[[['A', 'E'], ['B', 'F']],
            [['C', 'G'], ['D', 'H']]],
           [[['1'], ['2']], 
            [['3'], ['4']]], 
           [[['5'], ['6']],
            [['7'], ['8']]]]

gives:

[['A', 'E', '1', '5'],
 ['B', 'F', '2', '6'],
 ['C', 'G', '3', '7'],
 ['D', 'H', '4', '8']]

Explanation:

The flatten generator function simply flattens 1 level of a nested iterable. It is based on itertools.chain.from_iterable function. In map(flatten, samples) we apply this function to each element of samples:

>>> map(flatten, samples)
<map at 0x3c6685fef0>  # <-- map object returned, to see result wrap it in `list`:

>>> list(map(flatten, samples))
[<generator object flatten at 0x0000003C67A2F9A8>,  # <-- will flatten the 1st sample
 <generator object flatten at 0x0000003C67A2FA98>,  # <-- ... the 2nd
 <generator object flatten at 0x0000003C67A2FB10>]  # <-- ... the 3rd and so on if there are more

# We can see what each generator will give by applying `list` on each one of them
>>> list(map(list, map(flatten, samples)))
[[['A', 'E'], ['B', 'F'], ['C', 'G'], ['D', 'H']],
 [['1'], ['2'], ['3'], ['4']],
 [['5'], ['6'], ['7'], ['8']]]

Next, we can use zip to iterate over the flattened samples. Note that we cannot apply it on map object directly:

>>> list(zip(map(flatten, samples)))
[(<generator object flatten at 0x0000003C66944138>,),
 (<generator object flatten at 0x0000003C669441B0>,),
 (<generator object flatten at 0x0000003C66944228>,)]

we should unpack it first:

>>> list(zip(*map(flatten, samples)))
[(['A', 'E'], ['1'], ['5']),
 (['B', 'F'], ['2'], ['6']),
 (['C', 'G'], ['3'], ['7']),
 (['D', 'H'], ['4'], ['8'])]

# or in a for loop:
>>> for elements in zip(*map(flatten, samples)):
...     print(elements)
(['A', 'E'], ['1'], ['5'])
(['B', 'F'], ['2'], ['6'])
(['C', 'G'], ['3'], ['7'])
(['D', 'H'], ['4'], ['8'])

Finally, we just have to join all the lists in each elements tuple together. We can use the same flatten function for that:

>>> for elements in zip(*map(flatten, samples)):
...     print(list(flatten(elements)))
['A', 'E', '1', '5']
['B', 'F', '2', '6']
['C', 'G', '3', '7']
['D', 'H', '4', '8']

And you just have to put it all back in a list as shown in the first code sample.

Georgy
  • 12,464
  • 7
  • 65
  • 73