Flatten nested lists with indices

Question

Given a list of arbitrarily deep nested lists of arbitrary size, I would like a flat, depth-first iterator over all elements in the tree, but with path indices as well such that:

for x, y in flatten(L), x == L[y[0]][y[1]]...[y[-1]].

That is

L = [[[1, 2, 3], [4, 5]], [6], [7,[8,9]], 10]
flatten(L)

should yield:

(1, (0, 0, 0)),
(2, (0, 0, 1)),
(3, (0, 0, 2)),
(4, (0, 1, 0)),
(5, (0, 1, 1)),
(6, (1, 0)),
(7, (2, 0)),
(8, (2, 1, 0)),
(9, (2, 1, 1)),
(10, (3,))

I made a recursive implementation for this using generators with yield statements:

def flatten(l):
    for i, e in enumerate(l):
        try:
            for x, y in flatten(e):
                yield x, (i,) + y
        except:
            yield e, (i,)

but I don't think it's a good or responsible implementation, Is there any recipe for doing this more generally just using builtins or std lib stuff like itertools ?

https://stackoverflow.com/a/406199/4225229 `itertools.chain` can help here — JacobIRR, Feb 26 '18 at 19:58

Stefan Pochmann · Answer 1 · 2018-03-02T16:52:45.323

I think your own solution is alright, that there's nothing really simpler, and that Python's standard library stuff wouldn't help. But here's another way anyway, which works iteratively instead of recursively so it can handle very deeply nested lists.

def flatten(l):
    stack = [enumerate(l)]
    path = [None]
    while stack:
        for path[-1], x in stack[-1]:
            if isinstance(x, list):
                stack.append(enumerate(x))
                path.append(None)
            else:
                yield x, tuple(path)
            break
        else:
            stack.pop()
            path.pop()

I keep the currently "active" lists on a stack of enumerate iterators, and the current index path as another stack. Then in a while-loop I always try to take the next element from the current list and handle it appropriately:

If that next element is a list, then I push its enumerate iterator on the stack and make room for the deeper index on the index path stack.
If that next element is a number, then I yield it along with its path.
If there was no next element in the current list, then I remove it (or rather its iterator) and its index spot from the stacks.

Demo:

>>> L = [[[1, 2, 3], [4, 5]], [6], [7,[8,9]], 10]
>>> for entry in flatten(L):
        print(entry)

(1, (0, 0, 0))
(2, (0, 0, 1))
(3, (0, 0, 2))
(4, (0, 1, 0))
(5, (0, 1, 1))
(6, (1, 0))
(7, (2, 0))
(8, (2, 1, 0))
(9, (2, 1, 1))
(10, (3,))

Note that if you process the entries on the fly, like printing does, then you could just yield the path as the list it is, i.e., use yield x, path. Demo:

>>> for entry in flatten(L):
        print(entry)

(1, [0, 0, 0])
(2, [0, 0, 1])
(3, [0, 0, 2])
(4, [0, 1, 0])
(5, [0, 1, 1])
(6, [1, 0])
(7, [2, 0])
(8, [2, 1, 0])
(9, [2, 1, 1])
(10, [3])

This way, the iterator only takes O(n) time for the whole iteration, where n is the total number of objects in the structure (both lists and numbers). Of course the printing increases the complexity, just like creating the tuples does. But that's then outside of the generator and the "fault" of the printing or whatever you're doing with each path. If you for example only look at each path's length instead of its contents, which takes O(1), then the whole thing even actually is O(n).

All that said, again, I think your own solution is alright. And clearly simpler than this. And like I commented under @naomik's answer, I think your solution not being able to handle lists of depth around 1000 or more is rather irrelevant. One should not even have such a list in the first place. If one does, that's a mistake that should be fixed instead. If the list can also go wide, as in your case, and is balanced, then even with a branch factor of just 2 you'd run out of memory at a depth well under 100 and you wouldn't get anywhere near 1000. If the list can't go wide, then nested lists is the wrong data structure choice, plus you wouldn't be interested in the index path in the first place. If it can go wide but doesn't, then I'd say the creation algorithm should be improved (for example if it represents a sorted tree, add balancing).

About my solution again: Besides its ability to handle arbitrarily deep lists and its efficiency, I find some of its details interesting to note:

You rarely ever see enumerate objects being stored somewhere. Usually they're just used in loops&Co directly, like for i, x in enumerate(l):.
Having the path[-1] spot ready and writing into it with for path[-1], x in ....
Using a for-loop with an immediate break and an else branch, to iterate over the next single value and handle ends gracefully without try/except and without next and some default.
If you do yield x, path, i.e., don't turn each path into a tuple, then you really need to process it directly during the iteration. For example if you do list(flatten(L)), then you get [(1, []), (2, []), (3, []), (4, []), (5, []), (6, []), (7, []), (8, []), (9, []), (10, [])]. That is, "all" index paths will be empty. Of course that's because there really only is one path object which I update and yield over and over again, and in the end its empty. This is very similar to itertools.groupby, where for example [list(g) for _, g in list(groupby('aaabbbb'))] gives you [[], ['b']]. And it's not a bad thing. I recently wrote about that extensively.

Shorter version with one stack holding both indexes and enumerate objects alternatingly:

def flatten(l):
    stack = [None, enumerate(l)]
    while stack:
        for stack[-2], x in stack[-1]:
            if isinstance(x, list):
                stack += None, enumerate(x)
            else:
                yield x, stack[::2]
            break
        else:
            del stack[-2:]

Mulan · Answer 2 · 2018-03-01T20:12:14.167

1

Starting with direct recursion and state variables with default values,

def flatten (l, i = 0, path = (), acc = []):
  if not l:
    return acc
  else:
    first, *rest = l
    if isinstance (first, list):
      return flatten (first, 0, path + (i,), acc) + flatten (rest, i + 1, path, [])
    else:
      return flatten (rest, i + 1, path, acc + [ (first, path + (i,)) ])

print (flatten (L))
# [ (1, (0, 0, 0))
# , (2, (0, 0, 1))
# , (3, (0, 0, 2))
# , (4, (0, 1, 0))
# , (5, (0, 1, 1))
# , (6, (1, 0))
# , (7, (2, 0))
# , (8, (2, 1, 0))
# , (9, (2, 1, 1))
# , (10, (3,))
# ]

The program above shares the same weakness as yours; it is not safe for deep lists. We can use continuation-passing style to make it tail recursive – changes in bold

def identity (x):
  return x

# tail-recursive, but still not stack-safe, yet
def flatten (l, i = 0, path = (), acc = [], cont = identity):
  if not l:
    return cont (acc)
  else:
    first, *rest = l
    if isinstance (first, list):
      return flatten (first, 0, path + (i,), acc, lambda left:
        flatten (rest, i + 1, path, [], lambda right:
          cont (left + right)))
    else:
      return flatten (rest, i + 1, path, acc + [ (first, path + (i,)) ], cont)


print (flatten (L))
# [ (1, (0, 0, 0))
# , (2, (0, 0, 1))
# , (3, (0, 0, 2))
# , (4, (0, 1, 0))
# , (5, (0, 1, 1))
# , (6, (1, 0))
# , (7, (2, 0))
# , (8, (2, 1, 0))
# , (9, (2, 1, 1))
# , (10, (3,))
# ]

Finally, we replace the recursive calls with our own call mechanism. This effectively sequences the recursive calls and now it works for data of any size and any level of nesting. This technique is called a trampoline – changes in bold

def identity (x):
  return x

def flatten (l):
  def loop (l, i = 0, path = (), acc = [], cont = identity):  
    if not l:
      return cont (acc)
    else:
      first, *rest = l
      if isinstance (first, list):
        return call (loop, first, 0, path + (i,), acc, lambda left:
          call (loop, rest, i + 1, path, [], lambda right:
            cont (left + right)))
      else:
        return call (loop, rest, i + 1, path, acc + [ (first, path + (i,)) ], cont)

  return loop (l) .run ()

class call:
  def __init__ (self, f, *xs):
    self.f = f
    self.xs = xs

  def run (self):
    acc = self
    while (isinstance (acc, call)):
      acc = acc.f (*acc.xs)
    return acc

print (flatten (L))
# [ (1, (0, 0, 0))
# , (2, (0, 0, 1))
# , (3, (0, 0, 2))
# , (4, (0, 1, 0))
# , (5, (0, 1, 1))
# , (6, (1, 0))
# , (7, (2, 0))
# , (8, (2, 1, 0))
# , (9, (2, 1, 1))
# , (10, (3,))
# ]

Why is it better? Objectively speaking, it's a more complete program. Just because it appears more complex doesn't mean it is less efficient.

The code provided in the question fails when the input list is nested more then 996 levels deep (in python 3.x)

depth = 1000
L = [1]
while (depth > 0):
  L = [L]
  depth = depth - 1

for x in flatten (L):
  print (x)

# Bug in the question's code:
# the first value in the tuple is not completely flattened
# ([[[[[1]]]]], (0, 0, 0, ... ))

Worse, when depth increases to around 2000, the code provided in the question generates a run time error GeneratorExitException.

When using my program, it works for inputs of any size, nested to any depth, and always produces the correct output.

depth = 50000
L = [1]
while (depth > 0):
  L = [L]
  depth = depth - 1

print (flatten (L))
# (1, (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 49990 more...))

print (flatten (range (50000)))
# [ (0, (0,))
# , (1, (1,))
# , (2, (2,))
# , ...
# , (49999, (49999,))
# ]

Who would have such a deep list anyway? One such common case is the linked list which creates deep, tree-like structures

my_list = [ 1, [ 2, [ 3, [ 4, None ] ] ] ]

Such a structure is common because the the outermost pair gives us easy access to the two semantic parts we care about: the first item, and the rest of the items. The linked list could be implemented using tuple or dict as well.

my_list = ( 1, ( 2, ( 3, ( 4, None ) ) ) )

my_list = { "first": 1
          , "rest": { "first": 2
                    , "rest": { "first": 3
                              , "rest": { "first": 4
                                        , "rest": None
                                        }
                              }
                    }
          }

Above, we can see that a sensible structure potentially creates a significant depth. In Python, [], (), and {} allow you to nest infinitely. Why should our generic flatten restrict that freedom?

It's my opinion that if you're going to design a generic function like flatten, we should choose the implementation that works in the most cases and has the fewest surprises. One that suddenly fails just because a certain (deep) structure is used is bad. The flatten used in my answer is not the fastest^[1], but it doesn't surprise the programmer with strange answers or program crashes.

^[1] I don't measure performance until it matters, and so I haven't done anything to tune flatten above. Another understated advantage of my program is that you can tune it because we wrote it – On the other hand, if for, enumerate and yield caused problems in your program, what would you do to "fix" it? How would we make it faster? How would we make it work for inputs of greater size or depth? What good is a Ferrari after it wrapped around a tree?

edited Mar 01 '18 at 20:12

answered Feb 26 '18 at 21:34

Mulan

129,518
31
228
259

They just look way more complicated, less efficient, and don't even produce correct output. – Stefan Pochmann Feb 26 '18 at 22:14
it's subjectively "better" because it's stack-safe, not using `try`/`except` to control the flow of the program, and might teach you a thing or two about computation. cheers all – Mulan Feb 26 '18 at 23:00
What do you mean with "stack-safe"? Your first two can't even handle `flatten(list(range(1000)))` due to too deep recursion, even though it's a simple flat list. Your third can, but it looks ridiculously complicated. What's with try/except? EAFP is a "common Python coding style", says the official Python glossary. – Stefan Pochmann Feb 26 '18 at 23:10
@StefanPochmann thanks for giving me an opportunity to teach you about stack safety. I've made an edit to my answer :D – Mulan Feb 27 '18 at 05:52
Riiight. I totally didn't know that too deep recursion is bad. Sure. Your first two still can't even handle a long *totally flat* list, so they're not "stack-safe" even though there's no good reason not to be. And note that those are the ones I asked about, your hadn't included your third back then. – Stefan Pochmann Feb 27 '18 at 13:01
... Your third solution, for your list of depth 50000, btw *crashes* on my PC with MemoryError and on repl.it with "repl process died unexpectedly". On ideone.com it gets "Time limit exceeded #stdin #stdout 5s 4849408KB". Not sure how they measure memory, can't believe they let you use 4.8 GB. Anyway, my own solution gets "Success #stdin #stdout 0.07s 13592KB" there for that list. – Stefan Pochmann Feb 27 '18 at 13:01
... Is your *"Just because it appears more complex doesn't mean it is less efficient"* referring to my comment? It being complicated is *not* why I said it's less efficient. But it really is. Very much. I don't even have to come up with my own mean test case. I can just use your own `flatten(range(50000))`. On my PC, the OP's solution takes about 0.05 seconds for that and yours takes about 8.5 seconds. 170 times slower! And my own solution takes about 0.025 seconds for it. – Stefan Pochmann Feb 27 '18 at 13:02
... Btw, I think the OP's solution not being able to handle such deep lists is rather irrelevant. One should not even have such a list in the first place. If one does, that's a problem that should be fixed instead. If the list can also go *wide*, as in the OP's case, then even with a branch factor of just 2 you'll run out of memory at a depth around 25. Nowhere near 50000. If the list *can't* go wide, then nested lists is the wrong data structure choice. If it *can* but *doesn't*, then likely the creation algorithm should be improved (for example if it represents a sorted tree, add balancing). – Stefan Pochmann Feb 27 '18 at 13:21
... Another test: With depth 31000 yours doesn't crash on my PC, so I can compare it with mine there. Yours takes about 2.8 seconds and mine takes about 0.021 seconds, so yours is about 133 times slower there. And really, 2.8 seconds to do 31000 steps is just unreasonably slow. And the 8.5 seconds for 50000 steps is even worse. I'm not running this on a pocket calculator but on an i7-6700 desktop PC. – Stefan Pochmann Feb 27 '18 at 14:12
... Hmm, [ideone's FAQ](https://ideone.com/faq) does say memory usage is constrained to 256 MB (which is about what I had expected), but I tested a [simple program](https://ideone.com/UGMpFD) that uses several GB similar to how yours does, and ideone was totally fine with it. So apparently their FAQ is wrong, and I now think the 4.8 GB reported earlier is indeed what your program used there at some point. – Stefan Pochmann Feb 27 '18 at 14:46
Wow, thanks for the extensive albeit snarky performance review! I made a final edit to address some of your comments – And good job on your solution; now readers can see how to do it using functional-style and recursion or how to do it using imperative-style and loops. I hope everyone gets to learn something from our efforts. – Mulan Feb 27 '18 at 15:52
Yeah, snark is what I do when I'm somewhat annoyed :-P. But while I think neither of our answers are really providing what the OP wants or needs, your third solution is certainly interesting. Otherwise I wouldn't have spent that much effort on this :-). This might be the first time I've seen someone use trampolines to solve a given problem, I only knew them from [this blog post](http://www.usrsb.in/blog/blog/2012/08/12/bouncing-pythons-generators-with-a-trampoline/). I'll check out your update later, no more time now. – Stefan Pochmann Feb 27 '18 at 16:19
Your latest update broke it: https://eval.in/964636 (see the output there). About your added thoughts: Even if you do build a linked list like that (which I'd say is pretty *uncommon* and you'd probably better use a `list` or a `deque`), will you really be interested in the index paths? They'll just be long tuples completely full of `1` followed by a single `0`. Not exactly interesting or useful or efficient. And trivial to do better: https://eval.in/964643. About `for`, `enumerate` and `yield` causing problems: They don't. Misusing them might. What to fix then? That misuse. – Stefan Pochmann Feb 28 '18 at 20:27
One more thought, about your argument that *generic* functions are bad if they can't handle this: I'd say Python's own `str` and `repr` are about as generic as it gets. They handle way more than just nested lists, they handle all kinds of data. And yet, both of them crash for a list of depth 1000, due to too deep recursion. Even `copy.deepcopy(L)` crashes, despite deep data being its whole point! It's in its name! Even `L == L[0]` crashes. So you're saying `str` and `repr` and `deepcopy` and `==` are all bad? I'd rather say it's evidence that such deep lists are indeed unpythonic. – Stefan Pochmann Feb 28 '18 at 20:52
I think to fix your code, you just need to change `(cont, left + right)` to `call(cont, left + right)`. Btw, do you avoid `acc.append(...)` intentionally? If you used that, your code could become simpler and more efficient. – Stefan Pochmann Mar 01 '18 at 20:00
Yes `cont (left + right)` is sufficient here, but I originally used `call (cont, left + right)`. It must’ve been mangled in an edit. Thanks again for your attention to detail :D – Mulan Mar 01 '18 at 20:17
What about the `acc.append`? If you do that, then you can just pass the same `acc` object around the whole time, no need to restart with a fresh `[]` and no need for `left + right`. Less code and more efficient. You could then also define `acc` in an outer scope, then you don't even need to pass it around. Did those variations here (and in the end also a variation where I eliminated the `else` blocks): https://eval.in/965405 – Stefan Pochmann Mar 01 '18 at 20:41
... oops, when I switched to `acc` in outer scope, I forgot that I don't need to pass it to `cont` anymore, so this is what I actually wanted to end up with: https://eval.in/965412 – Stefan Pochmann Mar 01 '18 at 20:53
Stefan, I appreciate your thorough examination of my answer. When writing functional style, mutations like `append` and other side effects are avoided so the programmer can maintain *equational reasoning*. I understand it's a personal choice to trade speed or efficiency to maintain that property in a language like Python. The design of my program is influenced by monads, specifically the continuation monad. You can see what it might look like using `Cont` here: https://eval.in/965502 – Mulan Mar 01 '18 at 22:15
The imperative style used by your well-written program is clearly favoured by Python and the Python community. Python has all sorts of optimizations for the features you're using, whereas mine uses `lambda` which [even the creator sees as a problem](https://mail.python.org/pipermail/python-dev/2006-February/060415.html). That said, functional-style isn't *incorrect* in Python, it's just unconventional. My goal here on SO is to talk about computation in general and less concerned with optimization in a specific language. – Mulan Mar 01 '18 at 22:25
Yes, avoiding side effects is what I suspected when I asked whether you avoid `append` intentionally. Those terms you mentioned appear in the Haskell book I started reading, hope to get back to that soon :-). That `Cont` solution btw can't handle deep data anymore, crashes for depths ~990+. And I think you can give `left` to the second call instead of `[]` and then don't need `left + right` later: https://eval.in/965875 Why not do that? Are you sure Guido *sees* lambda as a problem? That was in 2006. (I always wonder when people show those old posts.) – Stefan Pochmann Mar 02 '18 at 13:20
Hey Stefan, the bug is on line 30 where `loop` eagerly recurs – not actually sure how to adapt this particular program without making some pretty big changes! – Mulan Mar 04 '18 at 05:29
I'm not a Python expert, but I did come up with some simplified examples to verify `Cont` behaves correctly. Here, `flatten` is different than the one asked about in the question, but the example is still helpful to see how the Cont sequencing works. I thought you might enjoy them https://eval.in/966384 – Mulan Mar 04 '18 at 05:38

pylang · Answer 3 · 2018-02-28T19:41:31.697

Recursion is good approach for flattening deeply nested lists. Your implementation is also well done. I would suggest modifying it with this similar recipe as follows:

Code

from collections import Iterable


def indexed_flatten(items):
    """Yield items from any nested iterable."""
    for i, item in enumerate(items):
        if isinstance(item, Iterable) and not isinstance(item, (str, bytes)):
            for item_, idx in indexed_flatten(item):
                yield item_, (i,) + idx
        else:
            yield item, (i,)


lst = [[[1, 2, 3], [4, 5]], [6], [7, [8, 9]], 10]
list(indexed_flatten(lst))

Output

[(1, (0, 0, 0)),
 (2, (0, 0, 1)),
 (3, (0, 0, 2)),
 (4, (0, 1, 0)),
 (5, (0, 1, 1)),
 (6, (1, 0)),
 (7, (2, 0)),
 (8, (2, 1, 0)),
 (9, (2, 1, 1)),
 (10, (3,))]

This robustly works with many item types, e.g. [[[1, 2, 3], {4, 5}], [6], (7, [8, "9"]), 10].

Flatten nested lists with indices

3 Answers3

Linked