7

Preface

I have a test where I'm working with nested iterables (by nested iterable I mean iterable with only iterables as elements).

As a test cascade consider

from itertools import tee
from typing import (Any,
                    Iterable)


def foo(nested_iterable: Iterable[Iterable[Any]]) -> Any:
    ...


def test_foo(nested_iterable: Iterable[Iterable[Any]]) -> None:
    original, target = tee(nested_iterable)  # this doesn't copy iterators elements

    result = foo(target)

    assert is_contract_satisfied(result, original)


def is_contract_satisfied(result: Any,
                          original: Iterable[Iterable[Any]]) -> bool:
    ...

E.g. foo may be simple identity function

def foo(nested_iterable: Iterable[Iterable[Any]]) -> Iterable[Iterable[Any]]:
    return nested_iterable

and contract is simply checks that flattened iterables have same elements

from itertools import (chain,
                       starmap,
                       zip_longest)
from operator import eq
...
flatten = chain.from_iterable


def is_contract_satisfied(result: Iterable[Iterable[Any]],
                          original: Iterable[Iterable[Any]]) -> bool:
    return all(starmap(eq,
                       zip_longest(flatten(result), flatten(original),
                                   # we're assuming that ``object()``
                                   # will create some unique object
                                   # not presented in any of arguments
                                   fillvalue=object())))

But if some of nested_iterable elements is an iterator, it may be exhausted since tee is making shallow copies, not deep ones, i.e. for given foo and is_contract_satisfied next statement

>>> test_foo([iter(range(10))])

leads to predictable

Traceback (most recent call last):
  ...
    test_foo([iter(range(10))])
  File "...", line 19, in test_foo
    assert is_contract_satisfied(result, original)
AssertionError

Problem

How to deep copy an arbitrary nested iterable?

Note

I'm aware of copy.deepcopy function, but it won't work for file objects.

Azat Ibrakov
  • 9,998
  • 9
  • 38
  • 50
  • Is there any reason you are averse to simply materializing the nested iterator into a nested list, let's say? – juanpa.arrivillaga Dec 14 '18 at 22:03
  • 1
    @juanpa.arrivillaga: yes, I'm writing a library which works with arbitrary iterables (both finite & infinite, user-defined or from standard library), and writing property-based tests – Azat Ibrakov Dec 14 '18 at 22:18

2 Answers2

5

Naive solution

Straightforward algorithm would be

  1. Perform elementwise copying of original nested iterable.
  2. Make n copies of elementwise copy.
  3. Obtain coordinates related to each independent copy.

which may be implemented like

from itertools import tee
from operator import itemgetter
from typing import (Any,
                    Iterable,
                    Tuple,
                    TypeVar)

Domain = TypeVar('Domain')


def copy_nested_iterable(nested_iterable: Iterable[Iterable[Domain]],
                         *,
                         count: int = 2
                         ) -> Tuple[Iterable[Iterable[Domain]], ...]:
    def shallow_copy(iterable: Iterable[Domain]) -> Tuple[Iterable[Domain], ...]:
        return tee(iterable, count)

    copies = shallow_copy(map(shallow_copy, nested_iterable))
    return tuple(map(itemgetter(index), iterables)
                 for index, iterables in enumerate(copies))

Pros:

  • quite easy to read & explain.

Cons:

  • if we wanted to extend our approach for iterables with greater nesting level (like iterable of nested iterables and so on) this approach doesn't look helpful.

We can do better.

Improved solution

If we look at itertools.tee function documentation, it contains Python recipe, which with help of functools.singledispatch decorator can be rewritten like

from collections import (abc,
                         deque)
from functools import singledispatch
from itertools import repeat
from typing import (Iterable,
                    Tuple,
                    TypeVar)

Domain = TypeVar('Domain')


@functools.singledispatch
def copy(object_: Domain,
         *,
         count: int) -> Iterable[Domain]:
    raise TypeError('Unsupported object type: {type}.'
                    .format(type=type(object_)))

# handle general case
@copy.register(object)
# immutable strings represent a special kind of iterables
# that can be copied by simply repeating
@copy.register(bytes)
@copy.register(str)
# mappings cannot be copied as other iterables
# since they are iterable only by key
@copy.register(abc.Mapping)
def copy_object(object_: Domain,
                *,
                count: int) -> Iterable[Domain]:
    return itertools.repeat(object_, count)


@copy.register(abc.Iterable)
def copy_iterable(object_: Iterable[Domain],
                  *,
                  count: int = 2) -> Tuple[Iterable[Domain], ...]:
    iterator = iter(object_)
    # we are using `itertools.repeat` instead of `range` here
    # due to efficiency of the former
    # more info at
    # https://stackoverflow.com/questions/9059173/what-is-the-purpose-in-pythons-itertools-repeat/9098860#9098860
    queues = [deque() for _ in repeat(None, count)]

    def replica(queue: deque) -> Iterable[Domain]:
        while True:
            if not queue:
                try:
                    element = next(iterator)
                except StopIteration:
                    return
                element_copies = copy(element,
                                           count=count)
                for sub_queue, element_copy in zip(queues, element_copies):
                    sub_queue.append(element_copy)
            yield queue.popleft()

    return tuple(replica(queue) for queue in queues)

Pros:

  • handles nesting on deeper levels or even mixed elements like both iterables and non-iterables on the same level,
  • may be extended for user-defined structures (e.g. for making independent deep copies of them).

Cons:

  • less readable (but as we know "practicality beats purity"),
  • provides some overhead related to dispatching (but it's ok since it is based on dictionary lookup which has O(1) complexity).

Test

Preparation

Let's define our nested iterable as follows

nested_iterable = [range(10 ** index) for index in range(1, 7)]

Since iterators creation says nothing about underlying copies performance, let's define function for iterators exhausting (described here)

exhaust_iterable = deque(maxlen=0).extend

Time

Using timeit package

import timeit

def naive(): exhaust_iterable(copy_nested_iterable(nested_iterable))

def improved(): exhaust_iterable(copy_iterable(nested_iterable))

print('naive approach:', min(timeit.repeat(naive)))
print('improved approach:', min(timeit.repeat(improved)))

I have on my laptop with Windows 10 x64 in Python 3.5.4

naive approach: 5.1863865
improved approach: 3.5602296000000013

Memory

Using memory_profiler package

Line #    Mem usage    Increment   Line Contents
================================================
    78     17.2 MiB     17.2 MiB   @profile
    79                             def profile_memory(nested_iterable: Iterable[Iterable[Any]]) -> None:
    80     68.6 MiB     51.4 MiB       result = list(flatten(flatten(copy_nested_iterable(nested_iterable))))

for "naive" approach and

Line #    Mem usage    Increment   Line Contents
================================================
    78     17.2 MiB     17.2 MiB   @profile
    79                             def profile_memory(nested_iterable: Iterable[Iterable[Any]]) -> None:
    80     68.7 MiB     51.4 MiB       result = list(flatten(flatten(copy_iterable(nested_iterable))))

for "improved" one.

Note: I've made different runs of script because making them at once won't be representative since second statement will reuse previously created under-the-hood int objects.


Conclusion

As we can see both functions have similar performance, but the last one supports deeper levels of nesting and looks pretty extensible.

Advertisement

I've added "improved" solution to lz package from 0.4.0 version which can be used like

>>> from lz.replication import replicate
>>> iterable = iter(range(5))
>>> list(map(list, replicate(iterable,
                             count=3)))
[[0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4]]

It is property-based tested using hypothesis framework, so we may be sure that it works as expected.

Azat Ibrakov
  • 9,998
  • 9
  • 38
  • 50
0

Addressing your question: How to deep copy a nested iterable?

You can use deepcopy from the standard library:

>>> from copy import deepcopy
>>> 
>>> ni = [1, [2,3,4]]
>>> ci = deepcopy(ni)
>>> ci[1][0] = "Modified"
>>> ci
[1, ['Modified', 3, 4]]
>>> ni
[1, [2,3,4]]

Update

@Azat Ibrakov said: you are working with sequences, try to deepcopy a file object for example (hint: it will fail)

No, deepcopy on a file object, won't fail, you can deep copy a file object, demonstration:

import copy

with open('example.txt', 'w') as f:
     f.writelines(["{}\n".format(i) for i in range(100)])

with open('example.txt', 'r') as f:
    l = [1, [f]]
    c = copy.deepcopy(l)
    print(isinstance(c[1][0], file))  # Prints  True.
    print("\n".join(dir(c[1][0])))

Prints:

True
__class__
__delattr__
__doc__
__enter__
__exit__
__format__
__getattribute__
...
write
writelines
xreadlines

The problem is in the concept.

According to Python Iterator protocol, the items contained by some container are obtained executing the next function see the docs here.

You won't have all items of an object that implements the iterator protocol (as file objects) until you traverse the whole iterator (execute next() until StopIteration exception is raised).

That's because there is no way you can tell for sure the result of executing the next (__next__ for Python 2.x) method of an iterator

See the following example:

import random

class RandomNumberIterator:

    def __init__(self):
        self.count = 0
        self.internal_it = range(10)  # For later demostration on deepcopy

    def __iter__(self):
        return self

    def next(self):
        self.count += 1
        if self.count == 10:
            raise StopIteration
        return random.randint(0, 1000)

ri = RandomNumberIterator()

for i in ri:
    print(i)  # This will print randor numbers each time.
              # Can you come out with some sort of mechanism to be able
              # to copy **THE CONTENT** of the `ri` iterator? 

Again you could:

from copy import deepcopy

cri = deepcopy(ri)

for i in cri.internal_it:
    print(i)   # Will print numbers 0..9
               # Deepcopy on ri successful!

A file object is an especial case here, there are file handlers involved, before, you see you can deepcopy a file object, but it will have closed state.

Alternative.

You could call list on your iterables, that will automatically evaluate iterables an then you will be able to test again THE ITERABLE'S CONTENT.

Returning to files:

with open('example.txt', 'w') as f:
         f.writelines(["{}\n".format(i) for i in range(5)])

with open('example.txt', 'r') as f:
    print(list(f))  # Prints ['0\n', '1\n', '2\n', '3\n', '4\n']

So, resuming

You can deepcopy nested iterables but, you can't evaluate iterables while they are being copied, it just has no sense (remember RandomNumberIterator).

If you need to test on the iterables CONTENT you need to evaluate them.

Raydel Miranda
  • 13,825
  • 3
  • 38
  • 60
  • 2
    you are working with sequences, try to deepcopy a file object for example (hint: it will fail) – Azat Ibrakov Dec 14 '18 at 20:39
  • which Python version are you using? for Python 3 `deepcopy`ing file object will end up with `TypeError: cannot serialize '_io.TextIOWrapper' object` – Azat Ibrakov Dec 15 '18 at 09:31
  • what do you mean by "you can't evaluate iterables while they are being copied", I can succesfully make copies of plain iterables with `itertools.tee` and then evaluate each one of them independently, even potentially infinite ones – Azat Ibrakov Dec 15 '18 at 09:35
  • in Python 2.7 if I do `copy.deepcopy(file)` I get `', mode '' at 0x7f4a99ac3930>` and after trying to iterate over it like `list(file_copy)` it raises `ValueError: I/O operation on closed file`, while original one works as expected, so no, you can't make functioning copy of file object with `copy.deepcopy` function – Azat Ibrakov Dec 15 '18 at 09:39
  • @Azat Ibrakov , in your code samples you're using type annotations, so, I assume you're using Python 3.x , so, I've used Python 3x as well for my answer. On the other hand, the file object, Is not the same as the file content. – Raydel Miranda Dec 15 '18 at 18:02
  • I know what file object is and what is file contents, I don't know what Python you are using, but your code doesn't work in Python 3.3+ and Python 2.7 for both Windows and Linux, so I don't know what are you trying to sell me here, also there may be user-defined iterable structures that has no support for `copy.deepcopy` – Azat Ibrakov Dec 15 '18 at 19:23