63

Consider this scenario:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os

walk = os.walk('/home')

for root, dirs, files in walk:
    for pathname in dirs+files:
        print os.path.join(root, pathname)

for root, dirs, files in walk:
    for pathname in dirs+files:
        print os.path.join(root, pathname)

I know that this example is kinda redundant, but you should consider that we need to use the same walk data more than once. I've a benchmark scenario and the use of same walk data is mandatory to get helpful results.

I've tried walk2 = walk to clone and use in the second iteration, but it didn't work. The question is... How can I copy it? Is it ever possible?

Thank you in advance.

Stefan
  • 1,697
  • 15
  • 31
Paulo Freitas
  • 13,194
  • 14
  • 74
  • 96
  • What's wrong with using `os.walk('/home')` twice? How is that a problem? – S.Lott Feb 09 '11 at 13:04
  • 2
    @S.Lott Well, that kind of task vary so much on each run. Another problem is that after first run the system will probably cache the results, so in the next runs we'll get unprecise results. The idea is to walk before and then measure two scenarios passing it as argument. :) – Paulo Freitas Feb 09 '11 at 13:16
  • Caching won't cause false results. – Sven Marnach Feb 09 '11 at 13:20
  • @pf.me: How can using `os.walk('/home')` twice be any different than the code you're trying to write where you "clone" the generator? What's wrong with writing the code two times? – S.Lott Feb 09 '11 at 13:23
  • @S.Lott While running `os.walk()` inside the methods I'm measuring, I noticed that on subsequently runs I get randomly results with seconds of difference. Then I'm aiming to measure what comes after the walk passing its data as argument. – Paulo Freitas Feb 09 '11 at 13:42
  • 1
    @pf.me: If you are doing profiling on the following operation, then you should definitely unroll the generator to a list in order to eliminate the variations in directory crawling (see my answer below). However, if the directory structure you are walking is very large, you might still get variation because of memory paging. – shang Feb 09 '11 at 14:19
  • 1
    @pf.me: "I noticed that on subsequently runs I get randomly results with seconds of difference." How does "cloning" the `os.walk('/home')` generator fix that? – S.Lott Feb 09 '11 at 15:03

6 Answers6

84

You can use itertools.tee():

walk, walk2 = itertools.tee(walk)

Note that this might "need significant extra storage", as the documentation points out.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • 11
    also, the [documentation](http://docs.python.org/2/library/itertools.html#itertools.tee) says: "In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use `list()` instead of `tee()`." Given the OP's original code snippet iterates through once completely, and then again, wouldn't it be recommended for him to use `list()`? – HorseloverFat Oct 21 '13 at 16:16
  • 1
    Use a cached generator instead, for example with `lambda: a_new_generator`, as described [here](http://stackoverflow.com/a/21315536/1959808). – 0 _ Dec 07 '14 at 00:39
  • 1
    See also the comments to [this answer](http://stackoverflow.com/a/1271481/1959808). – 0 _ Jan 19 '15 at 00:02
  • why do i see so many people saying there is no way to clone generators in python?? – Ishan Srivastava Apr 11 '18 at 09:43
  • @IshanSrivastava This doesn't actually clone the generator object. It just creates a new iterator yielding the same values, but the new objects aren't generators anymore. – Sven Marnach Apr 11 '18 at 12:40
  • 4
    No dude, this doesn't copy the generator, it transforms it into an iterator... Which is not a generator. Suppose I have a generator that fetches data partially, and sequentially from a sql table that contains 6 billion rows... If i use itertools.tee I explode my RAM – Imad Mar 22 '19 at 14:06
17

If you know you are going to iterate through the whole generator for every usage, you will probably get the best performance by unrolling the generator to a list and using the list multiple times.

walk = list(os.walk('/home'))

shang
  • 24,642
  • 3
  • 58
  • 86
  • 1
    Just out of curiosity, why does the necessity of iterating over every object in a generator make it more efficient to save a value-map in memory before iteration? – Rob Truxal Jul 16 '18 at 17:14
6

Define a function

 def walk_home():
     for r in os.walk('/home'):
         yield r

Or even this

def walk_home():
    return os.walk('/home')

Both are used like this:

for root, dirs, files in walk_home():
    for pathname in dirs+files:
        print os.path.join(root, pathname)
S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • 3
    While not the answer to the exact question the OP asked, this is a good way to do it without storing the complete directory tree in memory. +1 – Sven Marnach Feb 09 '11 at 13:09
  • @Sven Marnach: The "exact" question makes little sense. – S.Lott Feb 09 '11 at 13:23
  • Wold you say that defining a function is "better than" `itertools.tee()` in the aspect of the ["[…] significant auxiliary storage […]"](https://docs.python.org/3/library/itertools.html#itertools.tee) mentioned there? – Wolf May 08 '20 at 17:13
5

This is a good usecase for functools.partial() to make a quick generator-factory:

from functools import partial
import os

walk_factory = partial(os.walk, '/home')

walk1, walk2, walk3 = walk_factory(), walk_factory(), walk_factory()

What functools.partial() does is hard to describe with human-words, but this^ is what it's for.

It partially fills out function-params without executing that function. Consequently it acts as a function/generator factory.

Rob Truxal
  • 5,856
  • 4
  • 22
  • 39
3

This answer aims to extend/elaborate on what the other answers have expressed. The solution will necessarily vary depending on what exactly you aim to achieve.

If you want to iterate over the exact same result of os.walk multiple times, you will need to initialize a list from the os.walk iterable's items (i.e. walk = list(os.walk(path))).

If you must guarantee the data remains the same, that is probably your only option. However, there are several scenarios in which this is not possible or desirable.

  1. It will not be possible to list() an iterable if the output is of sufficient size (i.e. attempting to list() an entire filesystem may freeze your computer).
  2. It is not desirable to list() an iterable if you wish to acquire "fresh" data prior to each use.

In the event that list() is not suitable, you will need to run your generator on demand. Note that generators are extinguised after each use, so this poses a slight problem. In order to "rerun" your generator multiple times, you can use the following pattern:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os

class WalkMaker:
    def __init__(self, path):
        self.path = path
    def __iter__(self):
        for root, dirs, files in os.walk(self.path):
            for pathname in dirs + files:
                yield os.path.join(root, pathname)

walk = WalkMaker('/home')

for path in walk:
    pass

# do something...

for path in walk:
    pass

The aforementioned design pattern will allow you to keep your code DRY.

Six
  • 5,122
  • 3
  • 29
  • 38
0

This "Python Generator Listeners" code allows you to have many listeners on a single generator, like os.walk, and even have someone "chime in" later.

def walkme(): os.walk('/home')

m1 = Muxer(walkme) m2 = Muxer(walkme)

then m1 and m2 can run in threads even and process at their leisure.

See: https://gist.github.com/earonesty/cafa4626a2def6766acf5098331157b3

import queue
from threading import Lock
from collections import namedtuple

class Muxer():
    Entry = namedtuple('Entry', 'genref listeners, lock')

    already = {}
    top_lock = Lock()

    def __init__(self, func, restart=False):
        self.restart = restart
        self.func = func
        self.queue = queue.Queue()

        with self.top_lock:
            if func not in self.already:
                self.already[func] = self.Entry([func()], [], Lock())
            ent = self.already[func]

        self.genref = ent.genref
        self.lock = ent.lock
        self.listeners = ent.listeners

        self.listeners.append(self)

    def __iter__(self):
        return self

    def __next__(self):
        try:
            e = self.queue.get_nowait()
        except queue.Empty:
            with self.lock:
                try:
                    e = self.queue.get_nowait()
                except queue.Empty:
                    try:
                        e = next(self.genref[0])
                        for other in self.listeners:
                            if not other is self:
                                other.queue.put(e)
                    except StopIteration:
                        if self.restart:
                            self.genref[0] = self.func()
                        raise
        return e

    def __del__(self):
        with self.top_lock:
            try:
                self.listeners.remove(self)
            except ValueError:
                pass
            if not self.listeners and self.func in self.already:
                del self.already[self.func]
Erik Aronesty
  • 11,620
  • 5
  • 64
  • 44