0

On the surface, this might seem to be a duplicate of find first element in a sequence that matches a predicate but it is not.

I have a predicate function (function of one argument) that does some processing on the argument and returns a non-None value when the processing is said to "succeed". I want to use that function efficiently on a list or even some iterable but I do not want to iterate over all elements of the list or iterable, but just return the return value of the predicate function when that value is not None, and then stop executing the predicate on subsequent elements.

I was hoping there was something in itertools that would do this, but they all seem hardwired to return the element of the original items passed to the predicate, and instead I want the returned value.

I have a solution shown below, but is overly heavy code-wise. I'm wanting something more elegant and that does not require the firstof utility function coded there.

Note: Reading the entire file into a list of lines is actually necessary here, since I need the full contents in memory for other processing.

I'm using Python 2 here; I do not want to switch to Python 3 at this time but will want to avoid using syntax that is deprecated or missing in Python 3.

import re


def match_timestamp(line):
    timestamp_re = r'\d+-\d+-\d+ \d+:\d+:\d+'
    m = re.search(r'^TIMESTAMP (' + timestamp_re + ')', line)
    if m:
        return m.group(1)
    return None


def firstof(pred, items):
    """Find result from the first call to pred of items.

    Do not continue to evaluate items (short-circuiting)."""
    for item in items:
        tmp = pred(item)
        if tmp:
            return tmp
    return None


log_file = "/tmp/myfile"
with open(log_file, "r") as f:
    lines = f.readlines()
    for line in lines:
        print "line", line.rstrip()
    timestamp = firstof(match_timestamp, lines)
    print "** FOUND TIMESTAMP **", timestamp

Suppose I have /tmp/myfile contain:

some number of lines here
some number of lines here
some number of lines here
TIMESTAMP 2017-05-09 21:24:52
some number of lines here
some number of lines here
some number of lines here

Running the above program on it yeilds:

line some number of lines here
line some number of lines here
line some number of lines here
line TIMESTAMP 2017-05-09 21:24:52
line some number of lines here
line some number of lines here
line some number of lines here
** FOUND TIMESTAMP ** 2017-05-09 21:24:52
Community
  • 1
  • 1
bgoodr
  • 2,744
  • 1
  • 30
  • 51
  • 1
    I can't say this is more efficient, but it is an alternative that uses an `itertools` recipe, stops on the first true occurrence, returns the timestamp. `timestamp = match_timestamp(first_true(lines, default=None, pred=match_timestamp))` – pylang May 11 '17 at 05:52
  • 1
    @pylang +1 for notifying me of the `first_true` which I found to be shown in the recipe functions in Python 3 at https://docs.python.org/3.6/library/itertools.html#itertools-recipes but is not in Python 2 recipes at https://docs.python.org/2.7/library/itertools.html but it might apply to my use in Python 2. – bgoodr May 11 '17 at 21:37

2 Answers2

2
from itertools import imap, ifilter

timestamp = next(line for line in imap(match_timestamp, lines) if line)
# or
timestamp = next(ifilter(None, imap(match_timestamp, lines)))

(I believe that's the way to do it in Python 2, in Python 3 you'd simply use map.)

map the function over your lines so you get a lazy iterator of your transformed values, then lazily get the next truthy value from it using next and a generator expression or ifilter. You can choose whether to let next raise a StopIteration error if no value is found, or give it a second argument for the default return value.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • `filter` and `map` are now builtins in 2.7. – pylang May 11 '17 at 05:54
  • But are they lazy (generators) as well? From the documentation it didn't seem so. – deceze May 11 '17 at 06:01
  • 1
    Aha. You are correct. For generators in 2.7, you need to import from `itertools`. – pylang May 11 '17 at 06:19
  • Marking this as the answer. [Karin's answer](http://stackoverflow.com/a/43906566/257924) was a close second and also works. I'm choosing this one as it is using `itertools` in essentially a one-liner and eliminates my `firstof` function. – bgoodr May 11 '17 at 21:46
  • Also the important aspect of this is a part of the docs that I missed upon first read at [itertools.ifilter](https://docs.python.org/2.7/library/itertools.html#itertools.ifilter) which states: "If predicate is None, return the items that are true" which is utilized in the second one-liner in this answer. – bgoodr May 11 '17 at 21:48
1

Edited: You can create a generator and use it with next until a timestamp is found.

with open(log_file, "r") as f:
    lines = f.readlines()
    for line in lines:
        print "line", line.rstrip()
    timestamp = None
    generator = (match_timestamp(line) for line in lines)
    while timestamp is None:
        timestamp = next(generator)
    print "** FOUND TIMESTAMP **", timestamp
Karin
  • 8,404
  • 25
  • 34
  • Ok. That's definitely close, but that ends up caling `match_timestamp` twice. – bgoodr May 11 '17 at 04:58
  • Actually, `match_timestamp` is called on every line. This cannot be avoided since every line must pass the conditions of the predicate. – pylang May 11 '17 at 05:56
  • This doesn't appear to be true. Try something like `def foo(x): print x` and run `generator = (foo(i) for i in range(10))`. You should get no print out. Now try `for i in range(5): next(generator)`. You should only see 0 through 4 printed. – Karin May 11 '17 at 05:59
  • Quite right. Due to `next()`, your code isn't calling the predicate on every line. Thank you. Although there are still multiple calls, since every line is called up to the first non-None. The point I meant to convey to the OP was that multiple calls cannot be avoided. – pylang May 11 '17 at 06:12
  • Correct. I'm not trying to avoid multiple executions of the `match_timestamp` function, just multiple explicit _expressions_ in the code that call the function when one will do. I see your edits and yes that improves things. – bgoodr May 11 '17 at 21:43