How to create substrings efficiently

Question

Given a string, typically a sentence, I want to extract all substrings of lengths 3, 4, 5, 6. How can I achieve this efficiently using only Python's standard library? Here is my approach, I am looking for one which is faster. To me it seems the three outer loops are inevitable either way, but maybe there is a low-level optimized solution with itertools or so.

import time

def naive(test_sentence, start, end):
    grams = []
    for word in test_sentence:
        for size in range(start, end):
            for i in range(len(word)):
                k = word[i:i+size]
                if len(k)==size:
                    grams.append(k)
    return grams

n = 10**6
start, end = 3, 7
test_sentence = "Hi this is a wonderful test sentence".split(" ")

start_time = time.time()
for _ in range(n):
    naive(test_sentence, start, end)
end_time = time.time()

print(f"{end-start} seconds for naive approach")

Output of naive():

['thi', 'his', 'this', 'won', 'ond', 'nde', 'der', 'erf', 'rfu', 'ful', 'wond', 'onde', 'nder', 'derf', 'erfu', 'rful', 'wonde', 'onder', 'nderf', 'derfu', 'erful', 'wonder', 'onderf', 'nderfu', 'derful', 'tes', 'est', 'test', 'sen', 'ent', 'nte', 'ten', 'enc', 'nce', 'sent', 'ente', 'nten', 'tenc', 'ence', 'sente', 'enten', 'ntenc', 'tence', 'senten', 'entenc', 'ntence']

Second version:

def naive2(test_sentence,start,end):
    grams = []
    for word in test_sentence:
        if len(word) >= start:
            for size in range(start,end):
                for i in range(len(word)-size+1):
                    grams.append(word[i:i+size])
    return grams

The `len(k)==size` check can be eliminated - the only way that can fail is if you start your slice at a point too close to the end of the sentence, but that could be better handled by reducing the range of the `for i` loop. Also, do you really need all of the substrings to exist at the same time, in a list? Memory usage could be *vastly* reduced by yielding them one at a time in a generator function. — jasonharper, Dec 09 '21 at 20:34
Memory is not a problem for me, time is. Hmm okay thinking about the boundaries .. — Marcel Braasch, Dec 09 '21 at 20:36
Woah, I eliminated the length check and moved it to word level and looked for right boundaries. It's twice as fast. Changing the code. — Marcel Braasch, Dec 09 '21 at 20:48
Um, `{end-start} seconds` is not right. Could you fix that and also show your times for the two solutions? — Kelly Bundy, Dec 10 '21 at 21:27

score 4 · Accepted Answer · answered Dec 09 '21 at 22:22

4

Well, I think this is not possible to improve the algorithm, but you can micro-optimize the function:

def naive3(test_sentence,start,end):
    rng = range(start,end)
    return [word[i:i+size] for word in test_sentence
                           if len(word) >= start
                           for size in rng
                           for i in range(len(word)+1-size)]

Python 3.8 introduces assignment Expressions that are quite useful for performance. Thus if you can use a recent version, then you can write:

def naive4(test_sentence,start,end):
    rng = range(start,end)
    return [word[i:i+size] for word in test_sentence 
                           if (lenWord := len(word)+1) > start
                           for size in rng
                           for i in range(lenWord-size)]

Here are performance results:

naive2: 8.28 µs ±  55 ns per call
naive3: 7.28 µs ± 124 ns per call
naive4: 6.86 µs ±  48 ns per call    (20% faster than naive2)

Note that half of the time of naive4 is spent in creating the word[i:i+size] string objects and the rest is mainly spent in the CPython interpreter (mainly due to the creation/reference-counting/deletion of variable-sized integer objects).

answered Dec 09 '21 at 22:22

Jérôme Richard

41,678
6
29
59

Very nice, I always hated that you can't declare variables in a list comprehension. I like this approach + 20% is really good for my application. – Marcel Braasch Dec 09 '21 at 23:11
@MarcelB Not sure what you mean, but sounds wrong. You "declare" variables in pretty much *every* list comprehension. For example `word`, `size` and `i` in the first one here. – Kelly Bundy Dec 19 '21 at 11:17
@KellyBundy I mean constants which you can reuse as in the example above ... – Marcel Braasch Dec 19 '21 at 14:12
@MarcelB What constants? And I'd define constants before (i.e., outside of) the list comprehension. – Kelly Bundy Dec 19 '21 at 14:15
@MarcelB If you mean `lenWord`, that's no more "declared" or "constant" than `word`, `size` or `i`. And you've always been able to set it just like those, for example with `for lenWord in [len(word)+1]`. – Kelly Bundy Dec 19 '21 at 14:28
@KellyBundy while you could indeed do that, it is not as efficient as using an assignment expression (in fact using this pattern makes the code as slow as `naive3`) because of the additional loop/branch and the creation of a useless list. This is the old unpythonic way to do that though. New assignment expressions address this and are more expressive. – Jérôme Richard Dec 19 '21 at 15:11
@JérômeRichard Then you're using an old Python version. With new ones, there's no loop and no list creation. And it was never unpythonic. – Kelly Bundy Dec 19 '21 at 15:33
Indeed, Python 3.9 (released about 1 year ago) optimizes this case and produces a similar code to `naive4` with the for pattern. I used Python 3.8. Thank you for pointing this out. Regarding the pythonic style I consider this to be more pythonic as the *intent* of the assignment expression is more clear (and readable) than the additional loop because there is no need for an actual loop: the intent is encoded in another language feature (and I think one purpose the PEP 572 was to simplify the syntax in such a case because the previous solution was not so great). However, this is opinion-based. – Jérôme Richard Dec 19 '21 at 16:09
1

Yeah, I can agree it's opinion-based. For me it also depends on the case. In this case, the whole `(lenWord := len(word)+1) > start` looks ok (except that the variable name lies about its value). In other cases, I prefer to separate setting the variable from using it. Also, one advantage of the "list" idiom, and the reason it did get optimized after all, is that it doesn't leak the variable to outside the comprehension. Btw, a micro-optimization that I think could have a larger effect is to prepare a list of `slice` objects for each word length, instead of recreating them all the time. – Kelly Bundy Dec 19 '21 at 17:54

score 0 · Answer 2 · answered Dec 09 '21 at 20:57

0

I believe this will do it:

test_sentence = "Hi this is a wonderful test sentence".split()

lengths = [3, 4, 5, 6]

result = []
for t in test_sentence:
    for l in lengths:
        if len(t) >= l:
            start = 0
            while start + l <= len(t):
                result.append(t[start:start+l])
                start += 1

answered Dec 09 '21 at 20:57

defladamouse

567
2
13

Can you please show that this is indeed faster than the OPs method? It looks similar and would have the same Big O value of n^3 – Matthew Barlowe Dec 09 '21 at 21:02
It is exactly as fast as my second approach (tried 10 runs and the values differ marginally every time). – Marcel Braasch Dec 09 '21 at 21:03
@MatthewBarlowe from theoretical perspective I'm not sure there is too much to optimize. It might be more of a library thing here. – Marcel Braasch Dec 09 '21 at 21:04
If you're really looking for performance, you're probably best of writing it in `C` and just using `Python` as a wrapper. – defladamouse Dec 10 '21 at 13:56

How to create substrings efficiently

2 Answers2