8

I really like Python generators. In particular, I find that they are just the right tool for connecting to Rest endpoints - my client code only has to iterate on the generator that is connected the the endpoint. However, I am finding one area where Python's generators are not as expressive as I would like. Typically, I need to filter the data I get out of the endpoint. In my current code, I pass a predicate function to the generator and it applies the predicate to the data it is handling and only yields data if the predicate is True.

I would like to move toward composition of generators - like data_filter(datasource( )). Here is some demonstration code that shows what I have tried. It is pretty clear why it does not work, what I am trying to figure out is what is the most expressive way of arriving at the solution:

# Mock of Rest Endpoint: In actual code, generator is 
# connected to a Rest endpoint which returns dictionary(from JSON).
def mock_datasource ():
    mock_data = ["sanctuary", "movement", "liberty", "seminar",
                 "formula","short-circuit", "generate", "comedy"]
    for d in mock_data:
        yield d

# Mock of a filter: simplification, in reality I am filtering on some
# aspect of the data, like data['type'] == "external" 
def data_filter (d):
    if len(d) < 8:
        yield d

# First Try:
# for w in data_filter(mock_datasource()):
#     print(w)
# >> TypeError: object of type 'generator' has no len()

# Second Try 
# for w in (data_filter(d) for d in mock_datasource()):
#     print(w)
# I don't get words out, 
# rather <generator object data_filter at 0x101106a40>

# Using a predicate to filter works, but is not the expressive 
# composition I am after
for w in (d for d in mock_datasource() if len(d) < 8):
    print(w)
chladni
  • 119
  • 7
  • 1
    How do you feel about the built-in `filter()`? – Kevin Jan 12 '18 at 19:15
  • Good suggestion - if I use a predicate function I write filter(data_predicate, mock_datasource()). However, I do prefer the approach where I can write the generate composition like f(g(x)) – chladni Jan 12 '18 at 19:40
  • 1
    @Kevin in that case `filter` calls for `lambda` and now you have a clunky expression. `filter` is good when the filtering function already exists (like `str.isdigit`, `None` to test truth values, or such, – Jean-François Fabre Jan 12 '18 at 19:51
  • 1
    @Jean-FrançoisFabre, agreed, `filter` is a "sometimes" solution. Which is why I didn't go to the effor to build a full-fledged answer around it :-P – Kevin Jan 12 '18 at 20:11
  • `filter` was _very_ useful on strings in python 2 because it saved the need for `str.join`. Now the joy is gone :) – Jean-François Fabre Jan 12 '18 at 20:12

5 Answers5

4

data_filter should apply len on the elements of d not on d itself, like this:

def data_filter (d):
    for x in d:
        if len(x) < 8:
            yield x

now your code:

for w in data_filter(mock_datasource()):
    print(w)

returns

liberty
seminar
formula
comedy
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • Thanks, this I seems get me the closest to what I what I asked for. That being said, I wonder if composing generators entails a performance cost that I did not consider. – chladni Jan 12 '18 at 19:57
  • That's true, the more you're chaining function/generator calls, the slowest your application will be. Calling a function in python is more expensive than in compiled languages, partly because compiled languages have the ability to _inline_ some calls. – Jean-François Fabre Jan 12 '18 at 20:00
  • So far in testing comparing execution time of filtering with predicates, vs filtering with composed generators (i.e. based your answer), I am not seeing a huge performance penalty with composition approach. As often the case, need to run more tests ."The first principle is that you must not fool yourself and you are the easiest person to fool." Richard Feynman – chladni Jan 13 '18 at 03:30
  • that's true. You much bench your various approaches with a relevant datasize (size & contents). – Jean-François Fabre Jan 13 '18 at 09:19
1

More concisely, you can do this with a generator expression directly:

def length_filter(d, minlen=0, maxlen=8):
    return (x for x in d if minlen <= len(x) < maxlen)

Apply the filter to your generator just like a regular function:

for element in length_filter(endpoint_data()):
    ...

If your predicate is really simple, the built-in function filter may also meet your needs.

wim
  • 338,267
  • 99
  • 616
  • 750
0

You could pass a filter function that you apply for each item:

def mock_datasource(filter_function):
    mock_data = ["sanctuary", "movement", "liberty", "seminar",
             "formula","short-circuit", "generate", "comedy"]

    for d in mock_data:
        yield filter_function(d)

def filter_function(d):
    # filter
    return filtered_data
meow
  • 2,062
  • 2
  • 17
  • 27
  • Right - the approach you suggest is similar to the code I am using that is working. I am trying to put the filter at the output end of the datasource. I would like to lift the filter completely out of the generator's code. The closest I have come to that is the use of a predicate in the final example I gave. In any case thanks for the advice! – chladni Jan 12 '18 at 19:25
0

What I would do is define filter(data_filter) to receive a generator as input and return a generator with values filtered by data_filter predicate (regular predicate, not aware of generator interface).

The code is:

def filter(pred):
    """Filter, for composition with generators that take coll as an argument."""
    def generator(coll):
        for x in coll:
            if pred(x):
                yield x
    return generator

def mock_datasource ():
    mock_data = ["sanctuary", "movement", "liberty", "seminar",
                 "formula","short-circuit", "generate", "comedy"]
    for d in mock_data:
        yield d

def data_filter (d):
    if len(d) < 8:
        return True


gen1 = mock_datasource()
filtering = filter(data_filter)
gen2 = filtering(gen1) # or filter(data_filter)(mock_datasource())

print(list(gen2)) 

If you want to further improve, may use compose which was the whole intent I think:

from functools import reduce

def compose(*fns):
    """Compose functions left to right - allows generators to compose with same
    order as Clojure style transducers in first argument to transduce."""
    return reduce(lambda f,g: lambda *x, **kw: g(f(*x, **kw)), fns)

gen_factory = compose(mock_datasource, 
                      filter(data_filter))
gen = gen_factory()

print(list(gen))

PS: I used some code found here, where the Clojure guys expressed composition of generators inspired by the way they do composition generically with transducers. PS2: filter may be written in a more pythonic way:

def filter(pred):
    """Filter, for composition with generators that take coll as an argument."""
    return lambda coll: (x for x in coll if pred(x))
dqc
  • 416
  • 4
  • 8
0

Here is a function I have been using to compose generators together.

def compose(*funcs):
    """ Compose generators together to make a pipeline.
    e.g.
        pipe = compose(func1, func2, func3)
        result = pipe(range(0, 5))
    """
    return lambda x: reduce(lambda f, g: g(f), list(funcs), x)

Where funcs is a list of generator functions. So your example would look like

pipe = compose(mock_datasource, data_filter)
print(list(pipe))

This is not original

CpILL
  • 6,169
  • 5
  • 38
  • 37