2

I'm a bit new to functional in Python...

scenario

Here's the general problem. Suppose I have an input that I want to read through 1 time only. Let's say it is really big. Maybe I have a lot of filters, transformations, reductions, whatever to do on this stream. At the end, I want to produce a list of modest size and hand it off to something else as the result of my analysis.

If I want to create a single list, I'm in good shape. I will encode the above logic as operations on an iterable and provide these to a pipeline of tools like filter(), etc. This, I will give to a list comprehension which will efficiently build the resulting list.

But, what if my modest requirement is that I want two lists as output? For instance, I want all the 'falses' (of some question) in one list and all the 'trues' in another list. Or maybe I want three lists...

inadequate solutions

In this case, I take it I have two options:

  1. iterate my input to generate the first list, save that output and then iterate again to produce the second output list
  2. create two empty lists, manually iterate my pipeline's output, append to my lists according to my needs at each step

Both of those options stink in comparison with the list comprehension. One makes multiple passes and the second one calls append() repeatedly which (I guess I assume) is slow and is a construct living in arbitrary python instead of a clean, optimizable single statement.

existing modules?

I've looked through modules itertools and collections and peeked a bit at numpy. I see some things that might do the above, but their documentation explains that they are a convenience function and will result in buffering, etc. so they don't meet my requirements.

I love Python functional style, iterators and generators. I feel I've got a decent understanding of the benefits of iterators, even as they relate to inputs that are not files. I appreciate the subtle difficulties (e.g. buffering) that might arise from reading from multiple iterators simultaneously, when some may be 'slow inputs' and others 'fast inputs'.

In my case, I just want to consume 1 iterator. This sort of situation has arisen for me multiple times in the last few years.

Concluding with an example

# python 3
# Toy example. Just for reading, not worth running
import random
import itertools


num_samples = 1000000
least_favorite_number = 98


def source(count):
    for _ in range(count):
        yield random.randint(1, 100)


def my_functional_process(stream):
    """ Do silly things to an input iterable of ints, return an iterable of int pairs"""
    # Remove the hated number
    stream = itertools.filterfalse(lambda x: x == least_favorite_number, stream)

    # For each number, take note of which number preceded it in the stream
    def note_ancestor(l):
        prec = None
        for x in l:
            yield x, prec
            prec = x

    stream = note_ancestor(stream)

    # I don't like it even when you and your ancestor add up to our
    # least favorite number or if you have no ancestor
    stream = itertools.filterfalse(
        lambda x: x[1] is None or x[0] + x[1] == least_favorite_number,
        stream
    )

    # Good job
    return stream


def single_pass_the_slow_way():
    """
    Read through the iterator in a single pass, but build result in a way that I think is slow
    """
    the_fours = []
    not_fours = []

    stream = source(num_samples)
    processed = my_functional_process(stream)

    for x in processed:
        if x[0] == 4:
            the_fours.append(x)
        else:
            not_fours.append(x)

    return the_fours, not_fours


def single_pass_and_fast():
    """
    In this function, we make a single pass but create multiple lists using
    imaginary syntax.
    """
    stream = source(num_samples)
    processed = my_functional_process(stream)

    # In my dream, Python figures out to run these comprehensions in parallel
    # In reality, is there even a syntax to represent this?? Obviously, the
    # below does not do it
    not_real_code = [
        # just making up syntax here
        #        [x for x in ~x~ if x == 4],
        #        [x for x in ~x~ if x != 4]
        x for x in processed
    ]

    # These should be a list of fours, and all others respectively
    return not_real_code[0], not_real_code[1]


i_want_it = 'slow'

if i_want_it == 'slow':
    fours, others = single_pass_the_slow_way()
    print("We're done. ready to use those lists")
else:
    fours, others = single_pass_and_fast()
    print("We're done a bit faster. ready to use those lists")
Bill Huneke
  • 746
  • 4
  • 12
  • 2
    No, `.append` is **not slow** and is the "right" way of doing this in Python. Indeed, list comprehensions use `.append` under the hood. `single_pass_the_slow_way` is the way to go, albeit is isn't purely functional, but Python doesn't really support efficient, purely functional programming. It is at is core an imperative language. – juanpa.arrivillaga Nov 26 '19 at 21:50
  • Thank you, perhaps you are right. But as this interesting conversation points out, resolving 'append' and then calling it over and over does have a cost. https://stackoverflow.com/questions/30245397/why-is-a-list-comprehension-so-much-faster-than-appending-to-a-list – Bill Huneke Nov 26 '19 at 22:01
  • That's a separate issue, though. Appending to a linked list (without a dedicated tail pointer) is slow, but Python lists aren't linked lists. A Python list is an over allocated array that will grow as necessary. – chepner Nov 26 '19 at 22:02
  • Yes, that is one of the well-known optimizations of list-comprehensions, that it doesn't need to continously do method resolution for `x.list`, you can always simply do `append = my_list.append` to get the same benefit. There are other minor things, but all of these are *marginal* issues. In any case, as I said, *list comprehension already use `.append`*. There's just a specialized op-code for it. If you are worried about the overhead of method resolution, then you should be also worried about the overhead of calling your predicate functions in the `filter`/`map` and `itertools` contstructs – juanpa.arrivillaga Nov 26 '19 at 22:03
  • Note, the above is why I essentially never use `filter` or `map` anymore, and simply stick to list-comprehensions. Indeed, I recall that Guido wanted to remove `filter` and `map` from Python 3. – juanpa.arrivillaga Nov 26 '19 at 22:07
  • Ok. Point well made. I'll have to think whether I have any other questions. At this point, I can't even say why I thought so firmly that list comprehensions would be vastly faster. I guess it's a common misconception. Do you want to put your thoughts in an answer so I can accept it? – Bill Huneke Nov 26 '19 at 22:35

1 Answers1

2

I had a similar problem some time ago. I found an answer somewhere in here and I used them in my code afterwards. I cannot find the original question, if someone does find such link please comment it below so that I can integrate it in this answer.

There are two ways to do it:

'''
Let p1 be a function that checks if x has the property to belong to lst1
Let s be the list/iterator you want to iterate through
''' 

# 1st way - one loop

lst1, lst2 = [], []
for x in s:

    target = lst1 if p1(x) else lst2
    target.append(x)

# 2nd way - one list comprehension (Not recommended)

lst1, lst2 = [[x for x in cur_list if x is not None]\
               for cur_list in zip(*[(x,None) if p1(x) else (None,x)\
                                     for x in s])]

Now I think that you are looking for speed, thus let's check which one is faster with a toy example (you can check it with your actual code anyway):

import timeit
code1 = """
        lst1, lst2 = [], []

        for x in range(1_000_000):
            target = lst1 if x%3 else lst2
            target.append(x)
        """
elapsed_time1 = timeit.timeit(code1, number=100)/100
print(elapsed_time1)

code2 = """
        lst1, lst2 = [[x for x in cur_list if x is not None]\
            for cur_list in zip(*[(x,None) if x%3 else (None,x)\
                                  for x in range(1_000_000)])]
        """
elapsed_time2 = timeit.timeit(code2, number=100)/100
print(elapsed_time2)

Results

0.307000948000001
0.779959973

This shows us that the single loop using .append method is faster as claimed in the comments.

FBruzzesi
  • 6,385
  • 3
  • 15
  • 37
  • Of course, there you are doing a single `for` vs 3 `for` statments, what do you expect? – OSainz Nov 27 '19 at 09:21
  • I expected nothing different, just stating that it is possible (yet not recommended) to do it with a single nested list comprehension. – FBruzzesi Nov 27 '19 at 09:26
  • 2
    The point is that the autor want's to do It in a single pass because the input list is huge. So he expects a way to do it in a single pass using list comprehension. – OSainz Nov 27 '19 at 09:38