8

Is it possible to use yield inside the map function?

For POC purpose, I have created a sample snippet.

# Python 3  (Win10)
from concurrent.futures import ThreadPoolExecutor
import os
def read_sample(sample):
    with open(os.path.join('samples', sample)) as fff:
        for _ in range(10):
            yield str(fff.read())

def main():
    with ThreadPoolExecutor(10) as exc:
        files = os.listdir('samples')
        files = list(exc.map(read_sample, files))
        print(str(len(files)), end="\r")

if __name__=="__main__":
     main()

I have 100 files in samples folder. As per the snippet 100*10=1000 should be printed. However, it prints 100 only. When I checked it just print generator object only.

With what change it'll be 1000 printed?

Mehul Gupta
  • 1,829
  • 3
  • 17
  • 33
Jay Joshi
  • 1,402
  • 1
  • 13
  • 32
  • It is possible to use a generator ("``yield`` function") inside ``map``, but as you have observed this will just instantiate that generator. Is there a reason why ``read_sample`` does not just produce a list? What are you trying to achieve by using generators? Note that you can get the results by using ``list(itertools.chain(*exc.map(read_sample, files)))`` instead, but it will benefit from neither threads nor generator. – MisterMiyagi May 24 '20 at 08:29
  • does this help https://stackoverflow.com/questions/44708312/how-to-use-a-generator-as-an-iterable-with-multiprocessing-map-function ? – Nikos M. May 24 '20 at 08:29
  • 1
    I don't understand why you expect 1000. if original `files` is list with 100 names then result is also list with 100 elements and you print `len()` which means number of elements on list, not summary size of all elements (which could be 1000 - like `sum(len(x) for x in files)`) – furas May 24 '20 at 09:00
  • 2
    I guess what you actually want is to have `map` peek inside the generator, something like an non-existent `map_from` or something? – norok2 May 24 '20 at 09:50
  • as suggested @norok2 you would need something like `map_from` to run generator 10 times for every filename. Normal `map` will run function/generator only once for every filename. BTW: using `read()` it reads all data from file in first execution and next executions would create only empty results - so it seems useless. You would have to use ie. `read(5)` to read only part of file. – furas May 24 '20 at 10:28
  • Thank you for your replies. This is a snippet I created for POC only. In my product, I have a list of 100 `A` objects which has a regular expression. Those each `A` objects will have to generate 10 `B` objects. Therefore I am yielding 10 `B` objects. I am trying to make it multithreaded as they are all file operations and using the `map` function. If this POC is successful, I will apply the same concept in my product. However, I believe there should be a way in python to achieve this, regardless of whatever the business logic is. – Jay Joshi May 24 '20 at 11:50
  • Can you please clarify why you want to use generators for this? Generators are inherently cooperative concurrency, which conflicts with using threads to achieve preemptive concurrency. – MisterMiyagi May 24 '20 at 12:42
  • As I mentioned I have a function which returns a list (list using return or generator using yield), I want to call that function 100 times for 100 files. For that I am using `map`. I am not bound to use generator. But the same problem lies in returning list as well, At the end I will get list of list which has to be flattened before using. – Jay Joshi May 25 '20 at 04:38
  • Does this answer your question: [How to make a flat list out of list of lists?](https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists?r=SearchResults) – MisterMiyagi May 25 '20 at 05:09

1 Answers1

3

You can use map() with a generator, but it will just try to map generator objects, and it will not try to descend into the generators themselves.

A possible approach is to have a generator do the looping the way you want and have a function operate on the objects. This has the added advantage of separating more neatly the looping from the computation. So, something like this should work:

  • Approach #1
# Python 3  (Win10)
from concurrent.futures import ThreadPoolExecutor
import os
def read_samples(samples):
    for sample in samples:
        with open(os.path.join('samples', sample)) as fff:
            for _ in range(10):
                yield fff

def main():
    with ThreadPoolExecutor(10) as exc:
        files = os.listdir('samples')
        files = list(exc.map(lambda x: str(x.read()), read_samples(files)))
        print(str(len(files)), end="\r")

if __name__=="__main__":
     main()

Another approach is to nest an extra map call to consume the generators:

  • Approach #2
# Python 3  (Win10)
from concurrent.futures import ThreadPoolExecutor
import os
def read_samples(samples):
    for sample in samples:
        with open(os.path.join('samples', sample)) as fff:
            for _ in range(10):
                yield fff

def main():
    with ThreadPoolExecutor(10) as exc:
        files = os.listdir('samples')
        files = exc.map(list, exc.map(lambda x: str(x.read())), read_samples(files))
        files = [f for fs in files for f in fs]  # flattening the results
        print(str(len(files)), end="\r")

if __name__=="__main__":
     main()

A more minimal example

Just to get to some more reproducible example, the traits of your code can be written in a more minimal example (that does not rely on files laying around on your system):

from concurrent.futures import ThreadPoolExecutor


def foo(n):
    for i in range(n):
        yield i


with ThreadPoolExecutor(10) as exc:
    x = list(exc.map(foo, range(k)))
    print(x)
# [<generator object foo at 0x7f1a853d4518>, <generator object foo at 0x7f1a852e9990>, <generator object foo at 0x7f1a852e9db0>, <generator object foo at 0x7f1a852e9a40>, <generator object foo at 0x7f1a852e9830>, <generator object foo at 0x7f1a852e98e0>, <generator object foo at 0x7f1a852e9fc0>, <generator object foo at 0x7f1a852e9e60>]
  • Approach #1:
from concurrent.futures import ThreadPoolExecutor


def foos(ns):
    for n in range(ns):
        for i in range(n):
            yield i


with ThreadPoolExecutor(10) as exc:
    k = 8
    x = list(exc.map(lambda x: x ** 2, foos(k)))
    print(x)
# [0, 0, 1, 0, 1, 4, 0, 1, 4, 9, 0, 1, 4, 9, 16, 0, 1, 4, 9, 16, 25, 0, 1, 4, 9, 16, 25, 36]
  • Approach #2
from concurrent.futures import ThreadPoolExecutor


def foo(n):
    for i in range(n):
        yield i ** 2


with ThreadPoolExecutor(10) as exc:
    k = 8
    x = exc.map(list, exc.map(foo, range(k)))
    print([z for y in x for z in y])
# [0, 0, 1, 0, 1, 4, 0, 1, 4, 9, 0, 1, 4, 9, 16, 0, 1, 4, 9, 16, 25, 0, 1, 4, 9, 16, 25, 36]
Neuron
  • 5,141
  • 5
  • 38
  • 59
norok2
  • 25,683
  • 4
  • 73
  • 99
  • I was wondering could you also do it for .submit() as well instead of .map(). Does the same logic applies to it as well? – Lee Sai Mun Jun 09 '21 at 01:02