17

I have a chain of for-loops that works on an original list of strings and then gradually filtering the list as it goes down the chain, e.g.:

import re

# Regex to check that a cap exist in string.
pattern1 = re.compile(r'\d.*?[A-Z].*?[a-z]')
vocab = ['dog', 'lazy', 'the', 'fly'] # Imagine it's a longer list.

def check_no_caps(s):
    return None if re.match(pattern1, s) else s

def check_nomorethan_five(s):
    return s if len(s) <= 5 else None

def check_in_vocab_plus_x(s,x):
    # s and x are both str.
    return None if s not in vocab else s+x

slist = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
# filter with check_no_caps
slist = [check_no_caps(s) for s in slist]
# filter no more than 5.
slist = [check_nomorethan_five(s) for s in slist if s is not None]
# filter in vocab
slist = [check_in_vocab_plus_x(s, str(i)) for i,s in enumerate(slist) if s is not None]

The above is just an example and in reality my functions to manipulate the strings are more complicated but they do return the original string or a manipulated one.

I could use generators instead of list and do something like this:

slist = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
# filter with check_no_caps and no more than 5.
slist = (s2 check_no_caps(s1) for s1 in slist 
         for s2 in check_nomorethan_five(s1) if s1)
# filter in vocab
slist = [check_in_vocab_plus_x(s, str(i)) for i,s in enumerate(slist) if s is not None]

Or in one crazy nested generator:

slist = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
slist = (s3 check_no_caps(s1) for s1 in slist 
         for s2 in check_nomorethan_five(s1) if s1
         for s3 in check_in_vocab_plus_x(s2, str(i)) if s2)

There must be a better way. Is there a way to make the chain of for-loop faster?

Is there a way to do it with map, reduce and filter? Will it be faster?

Imagine that my original slist is very very large like 10s of billions. And my functions are a not as simple as the functions above, they do some computation and do around 1,000 calls per second.

Kjir
  • 4,437
  • 4
  • 29
  • 34
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    First of all, make `vocab` a set instead of a list. And, maybe I'm oversimplifying this, but what's wrong with using generator expressions instead of list comprehensions? It's not like you have to use one giant unreadable generator. – Aran-Fey Jul 17 '16 at 17:48
  • As I see it, you have the two options (1) do it all at once with list comprehensions _or_ (2) do it one at a time with generators. Even if `map` or `reduce` or `filter` is faster than a generator expression, that's going to be insignificant if your functions are going to do lots of computation. Aren't you optimizing the wrong thing here? If one element at a time isn't fast enough, shouldn't you be optimizing your functions instead of the loops? – Aran-Fey Jul 17 '16 at 18:02
  • Sadly the functions are fixed by someone else and it goes to a second person that wrote this chain of for-loops and when it comes to deploying the code on the real dataset, i found that it's extremely slow and i wrote the generator trick, i think it works because i'm not instantiating the list multiple times but it's still slow, and I thought `map/reduce/filter` might get me some speed ups. – alvas Jul 17 '16 at 18:09
  • I believe you will need to refactor the code a bit, could not find a way to do it in this situation. – Or Duan Jul 17 '16 at 19:50
  • When you do `enumerate` at the end, should it enumerate w.r.t the original list, or the previous list before that step, including the None elements, or the previous list with Nones already filtered out? Right now, you do the second option, which of the three IMHO makes the least sense. – tobias_k Jul 20 '16 at 15:32
  • 1
    Also, while you said that those `check` functions are fixed, it would really be better to split those up into `check` function, that only perform a check and return true or false, and `modify` functions, that always return a modified non-None result. It would then be much easier to use a chain of filter and map statements. – tobias_k Jul 20 '16 at 15:34
  • Have you timed it? Is the time being spent in the calls significant compared to the filtering function themselves? If not, then you are wasting your time. You can put your outer stuff in a function as things in a function are faster than being at top level in Python. BTW there is another option of using the `compose` function. But if your bottleneck is in the functions then I doubt it would help much. – Aseem Bansal Jul 20 '16 at 17:18
  • This is exactly the kind of code that gets 100X speed up just using PyPy. Give it a shot! – alec_djinn Jul 27 '16 at 08:47

6 Answers6

8

First of all is the overall process that you make on your strings. You are taking some strings and to each of them you apply certain functions. Then you cleanup the list. Let's say for a while that all the functions you apply to strings works at a constant time (it's no true, but for now it won't matter). In your solution you iterate throgh list applying one function (that's O(N)). Then you take next function and iterate again (another O(N)), and so on. So, the obvious way to speed-up is to reduce number of loops. That's not so difficult.

The next thing to do is try to optimize your functions. E.g. you use regexp to check whether string has capital letters, but there is str.islower (Return true if all cased characters in the string are lowercase and there is at least one cased character, false otherwise).

So, this is the very first attempt to simplify and speed-up your code:

vocab = ['dog', 'lazy', 'the', 'fly'] # Imagine it's a longer list.

# note that first two functions can be combined in one
def no_caps_and_length(s):
    return s if s.islower() and len(s)<=5 else None

# this one is more complicated and cannot be merged with first two
# (not really, but as you say, some functions are rather complicated)
def check_in_vocab_plus_x(s,x):
    # s and x are both str.
    return None if s not in vocab else s+x

# now let's introduce a function that would pipe a string through all functions you need
def pipe_through_funcs(s):
    # yeah, here we have only two, but could be more
    funcs = [no_caps_and_length, check_in_vocab_plus_x]
    for func in funcs:
        if s == None: return s
        s = func(s)
    return s

slist = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
# final step:
slist = filter(lambda a: a!=None, map(pipe_through_funcs, slist))

There might be one more thing that can be improved. Currently you iterate through list modifying elements and then filter it out. But if might be faster to filter and then modify. Like this:

vocab = ['dog', 'lazy', 'the', 'fly'] # Imagine it's a longer list.

# make a function that does all the checks for filtering
# you can make a big expression and return its result,
# or a sequence of ifs, or anything in-between,
# it won't affect performance,
# but make sure you put cheaper checks first
def my_filter(s):
    if len(s)>5: return False
    if not s.islower(): return False
    if s not in vocab: return False
    # maybe more checks here
    return True

# now we need modifying function
# there is a concern: if you need indices as they were in original list
# you might need to think of some way to pass them here
# as you iterate through filtered out list
def modify(s,x):
    s += x
    # maybe more actions
    return s

slist = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
# final step:
slist = map(modify, filter(my_filter, slist))

Note also, that in some cases generators, maps and things can be faster, but that is not always true. I believe, that if number of items you filter out is substantial, it might be faster to use a for-loop with append. I would not vouch that it will be faster but you could just try something like this:

initial_list = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
new_list = []
for s in initial_list:
    processed = pipe_through_funcs(s)
    if processed != None: new_list.append(processed)
Alissa
  • 686
  • 1
  • 8
  • 21
  • 1
    Actually `O(cn) = O(n)` when speaking about complexity, as long as `c` is a relatively small value. Also, in your solution you have a inner loop with a cost of `O(c)` which multiplied by the outer loop still is `O(cn)`: there is no optimization from the performance point of view, although it is way more elegant code. – Kjir Jul 21 '16 at 10:13
  • 1
    Oh, I haven't thought about the fact that the constant remains. As for coefficient itself, I'd say even if you have N=1000000, still you will notice the difference between O(N) and O(2N). As we do not know how big is the dataset, we cannot tell how much will the coefficient weight in overall performance. – Alissa Jul 21 '16 at 13:42
  • Yes, but compared to the size of `n` it would be meaningless and optimizing that wouldn't really make much difference... – Kjir Jul 22 '16 at 17:25
  • @Kjir If we are not talking about abstraction (and we are not), it would depend on actual size of n, don't you agree? – Alissa Jul 25 '16 at 08:48
  • As I said in my first comment, the importance of `c` is relative, meaning that it is relative to its size compared to `n`. Furthermore in this case `O(n)` is just the complexity of the outer loops, but it actually depends on the functions being called: say some of them run in `O(n^2)`, your `c` becomes really meaningless! – Kjir Jul 27 '16 at 12:40
3

If you make your transform functions unified then you could do something like this:

import random
slist = []
for i in range(0,100):
    slist.append(random.randint(0,1000))

# Unified functions which have the same function description
# x is the value
# i is the counter from enumerate
def add(x, i):
    return x + 2

def replace(x, i):
    return int(str(x).replace('2', str(i)))

# Specifying your pipelines as a list of tuples 
# Where tuple is (filter function, transformer function)
_pipeline = [
    (lambda s: True, add),
    (lambda s: s % 2 == 0, replace),
]

# Execute your pipeline
for _filter, _fn in _pipeline:
    slist = map(lambda item: _fn(*item), enumerate(filter(_filter, slist)))

The code works on both python 2 and python 3. The difference is that in Python3 everything returns a generator so it's not executed until it has to be. So effectively your going to have one iteration over your list.

print(slist)
<map object at 0x7f92b8315fd0>

However iterating over once or many times won't make much of a difference as long as it can be done in memory because regardless of the iteration method the same amount of transformation and filtering has to be executed. So to improve your code try to make your filter and transform functions as fast as possible.

For example what @Rawing mentioned to have vocab as a set instead of list will make a big difference especially with large number of items.

Károly Nagy
  • 1,734
  • 10
  • 13
3

You have a bunch of checks with which you can make an iterable:

def check1(s):
    if s.islower():
        return s

def check2(s):
    if len(s) < 5:
        return s

checks = [check1, check2]

And an iterable of strings:

l = ['dog', 'Cat', 'house', 'foo']

So one questions is whether you should iterate over checks first or strings first.

def checks_first(l, checks):
    for check in checks:
        l = filter(None, map(check, l))

    return list(l)


def strings_first(l, checks):
    res = []

    for item in l:
        for check in checks:
            item = check(item)
            if item is None:
                break
        else:
            res.append(item)

    return res

You can time these two approaches with the timeit module. Note: that you may have to use a subset of the strings to get these results in a timely fashion.

import timeit

print(timeit.timeit('checks_first(l, checks)', setup='from __main__ import checks_first, checks, l', number=10))
print(timeit.timeit('strings_first(l, checks)', setup='from __main__ import strings_first, checks, l', number=10))

Which is faster depends on the ratio of number checks to the of number strings, hardware, etc. From the tests I've done they seem to run at roughly the same speed.

My guess is that the biggest time savings will be gained by optimizing some of the checks. A good starting point is to identify the checks that cost the most time. This can be done with a closure to wrap your check functions.

import functools

def time_func(func, timer_dict):

    @functools.wraps(func)
    def inner(*args, **kwargs):
        t0 = time.time()
        res = func(*args, **kwargs)
        timer_dict[func.__name__] += time.time() - t0
        return res

    return inner

To apply this to the checks:

from collections import defaultdict

timer_dict = defaultdict(lambda: 0)
checks = [time_func(check, timer_dict) for check in checks]

Then call the function(s) that apply the checks and view timer_dict for timing information.

checks_first(l, checks)
strings_first(l, checks)

print(dict(timer_dict))

# {'check1': 0.41464924812316895, 'check2': 0.2684309482574463}

Then identify the bottlenecks in the costly checks either by inspection or profiling. The latter can be done by timing lines of code with the time module or using a line profiler like this.

Optimize your algos and data structures to get rid of these bottlenecks. You can take a look at Cython for code that you need to bring to (near) C speed.

Alex
  • 18,484
  • 8
  • 60
  • 80
2

First off: I think your example code is not doing what you think. The result is ['the0', 'dog1', None, None, 'the4', 'fly5'], but I believe you don't want the None values.

The only reasonable answer to this is to measure your performance and identify the bottlenecks, which will probably be in your check functions and not in the external loop.

From outside the check functions the only real optimization I see is to perform the checks that will reduce your set first, so that you'll have smaller loops in the following iterations and you reduce the number of checks you perform on values you'll discard anyway. Depending on your data and the number of values that get discarded in the first checks you might see quite a jump in performance... Or not!

The only way to really know is to profile your code. You should use cProfile together with RunSnakeRun and work on your bottlenecks, otherwise you'll end up optmizing the wrong stuff.

To profile your script you can run it as follows: python -m cProfile <script-name>

Kjir
  • 4,437
  • 4
  • 29
  • 34
2

I can see three optimizations you could possibly make. The first is that if all the words in "vocab" are less than or equal to five, you don't need to check if the words in "slist" are less than or equal to five, which means that you can remove that entire for loop. The second optimization is that if all the words in "vocab" are lowercase only and your word comparison algorithm is case sensitive, then you do not need to check if a word in "slist" is case sensitive, which means that you can remove that for loop.

The basic generalization for this principle is that if a word must fulfill several conditions and one condition implies another (i.e. if you need a number that is divisible by four and two, you just need to check if it's divisible by four.), you can remove the implied condition.

If "vocab" does have words longer than five letters or words with uppercase letters, you should be able to remove them from "vocab" as all words in "slist" that are uppercase or more than five letters long would be able to be removed from your checks before getting to vocab.

The last optimization is that determining if a word in "slist" is in "vocab" is the same as finding their intersection. There are many relatively quick algorithms to do this that do not require a for loop. Here are some examples:

Efficient list intersection algorithm

Computing set intersection in linear time?

In summary, you can remove two for loops and reduce the time complexity of the "vocab"-"slist" comparison for loop.

Community
  • 1
  • 1
AlgorithmsX
  • 307
  • 2
  • 10
1

Optimizations are much dependent on the specific code. Without knowing the what real manipulations run on strings and the nature of data, there is low chance for effective result. Moreover, OP specifically describe the string manipulations as "more complicated". This reduce the part of the outside loops in general performance.

Two relevant and simple tip I can add to other answers here, is about using built in function calls and generators to optimize:

  1. Built in functions improve performance in keeping more work at native code. When you use those for calling lambda or otherwise python calls, they lose most of the performance value. Use them when a task can be completed using only native builtins. itertools, operator, and functools modules can give much help in this.
  2. Generators mainly help in memory optimizations, where you don't want to hold all values in memory at once. If you can do all in one iteration without using them it is better and saves generator overheads.

Another thing I would have changed in the specific example is the use of regex. In this simple case of capital letters it may be quicket to just compare characters. Regular exprations not built well can be dangerous to performance, I tend to avoid them without the specific benefit in more complext comparisons.

vocab = ['dog', 'lazy', 'the', 'fly'] # Imagine it's a longer list.

def check_no_caps(s):
    for char in s:
        if 'A' <= char <= 'Z':
            return None
    return s

def check_nomorethan_five(s):
    return s if len(s) <= 5 else None

def check_in_vocab_plus_x(s, x):
    # s and x are both str.
    return None if s not in vocab else s + str(x)

slist = ['the', 'dog', 'jumps', 'over', 'the', 'fly']

result = [check_in_vocab_plus_x(check_nomorethan_five(check_no_caps(string)), i) for i, string in enumerate(slist)]
amotzg
  • 1,142
  • 12
  • 28