Optimization of list comprehension with string matching

Question

I have a function which looks like this:

def my_fun:
    return [match.start() for match in re.finditer(f'(?=({some_pattern}))', string)]

It just returns a list of integers, and yes I have tried a generator and it is super fast but unfortunately it has to be a list. Pattern may look like this word1|word2|word3... - many words :) It is an effect of my optimization work but it is too slow still. What can I do to tune up the performance? Should I consider using Cython?

Example string: https://www.codepile.net/pile/bD9R0qZR

Example pattern: https://www.codepile.net/pile/eRPK025Y

Perhaps compiling a regex would help a bit. Depending on actual expression there might be a faster method that doesn't use regexes. — zch, Jul 20 '19 at 21:37
How are the returned results used? A generator expression might be an option if you want to avoid upfront cost and only do work as it's needed. — Carcigenicate, Jul 20 '19 at 21:37
@zch Unfortunalety compiling a regex does not help - check this thread: https://stackoverflow.com/questions/452104/is-it-worth-using-pythons-re-compile — maslak, Jul 20 '19 at 21:48
It would help if you provided an example string to do some actual benchmarking. I could set up some c/c++/rust/python/regex comparisons. Just upload a file on something like share-online and provide a link. — Finomnis, Jul 20 '19 at 21:49
@Finomnis it would be great, I've provided examples in the question — maslak, Jul 20 '19 at 22:08
Have you tried just looping through `word`s with Python? Given the simplicity of the regex this may already be faster than what you are doing, and you could quite possibly speed things up with `numba`. — norok2, Jul 20 '19 at 22:15

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

I just implemented this using Python primitives and compared to your solution. The following code works also for multiple (and potentially overlapping) occurrences (using the match_all() helper function), and it is substantially faster than using regex:

import re


def find_all(text, pattern):
    len_text = len(text)
    i = 0
    while i < len_text:
        i = text.find(pattern, i)
        if i >= 0:
            yield i
            i += 1
        else:
            break


def my_fun_all(text, pattern):
    words = pattern.split('|')
    return sorted([i for word in words for i in find_all(text, word)])


def my_fun(text, pattern):
    return [match.start() for match in re.finditer(f'(?=({pattern}))', text)]


print(my_fun(text, pattern))
# [501, 523, 865, 878, 1407, 1728, 1956]
print(my_fun_all(text, pattern))
# [501, 523, 865, 878, 1407, 1728, 1956]

print(my_fun(text * 2, pattern) == my_fun_all(text * 2, pattern))
# True


%timeit my_fun(text, pattern)
# 1.48 ms ± 6.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit my_fun_all(text, pattern)
# 512 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit my_fun(text * 2, pattern)
# 2.97 ms ± 27.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit my_fun_all(text * 2, pattern)
# 843 µs ± 8.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit my_fun(text * 20, pattern)
# 29.4 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit my_fun_all(text * 20, pattern)
# 3.36 ms ± 31.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

EDIT

This was my solution before OP specified the need to support multiple occurrences:

def my_fun_new(text, pattern):
    words = pattern.split('|')
    result = []
    for word in words:
        i = text.find(word)
        if i >= 0:
            result.append(i)
    return sorted(result)

EDIT

Adding timings for @Finomnis implementation of Aho-Corasick:

import ahocorasick


def my_fun_ac(string, raw_pattern):
    A = ahocorasick.Automaton()
    for pattern in raw_pattern.split('|'):
        A.add_word(pattern, len(pattern))
    A.make_automaton()

    result = []
    for end_index, original_len in A.iter(string):
        result.append(end_index - original_len + 1)
    return result


print(my_fun(text * 2, pattern) == my_fun_ac(text * 2, pattern))
# True

%timeit my_fun_ac(text, pattern)
# 221 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit my_fun_ac(text * 2, pattern)
# 268 µs ± 972 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit my_fun_ac(text * 20, pattern)
# 1.09 ms ± 6.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

which is a good 2 to 4 factor faster than my implementation using Python builtins.

Just for completeness, I report what I have used for text and pattern:

text = """\
Intro:
[Snoop]
Da da da da daaaa
It's tha motha fuckin' D O double G (Snoop dogg)
Da da da da daaaa
You know im mobbin' with the D.R.E

[Kurupt]
Yeah yeah yeah
You know who's back up in this mothafucker *echoes*

[Snoop]
What what what what

[Kurupt]
So blaze the weed out there

[Snoop]
Blaze it up

[Kurupt]
Blaze that shit up nigga
Yeah waz up snoop

Verse one: [snoop]
Top dogg buy them all nigga burn this shit up
D-P-G-C my nigga turn that shit up
CPT, LBC yeah we hookin' back up
N' when they bang this in the club baby you gotta get up
Thug niggas, drug dealers yeah they givin' it up
Low life, your life boy we livin' it up
Take a chance that's why we dancin'
In the party fo' sho'
Slip my ho a fourty four n' she crept in it back do'
Bitches lookin' at me strange but you know i don't care
Step up in this mothafucker just to swingin' my hair
Bitch quit talkin' Crip walk
If you down with the set
Take a Bullet with some dick
Take this dope from this jet
Outta town put it down for father of rap
N' if your ass get crack bitch shut your trap
Come back get back that's the part of success
N' if you believe the X then you'll relievin' your stress

Music in between

[snoop]
Da da da da daaaa

[Dre]
It's the mothafuckin' D-R-E

[Kurupt]
Dr. Dre mothafucker [Snoop] what what what what

[Snoop]
Da da da da daaaa

Verse two: [Dre]
You know im mobbing with the D O double G
Straight off the fuckin' street's of CPT
King of the beats n' you ride to em' in your fleet (Fleetwood) wood
Or Coupe DeVille rollin on dubs
How you feel?
Whoopty whoop nigga what?
Dre n' snoop chronic'ed out
In the 'llac with doc in the back
Sippin' 'gnac, clip in the strap
Dippin' through hoods
What hoods? Compton, longbeach, ingelwood
South central out to the westside (westside)
It's california love this california bud
Got a nigga gang of pub
I'm on one, I might bail up in the Century Club
With my jeans on n' my team's strong
Get my drink on n' my smoke on
Then go home with somethin' to poke on (waz up bitch)
Loc' it's on for the two tripple oh
Comin' real it's the next episode *Echoes*

Music in between

[Nate Dogg]
Hold up.. heeeeey
For my niggas who be thinkin' we soft
We don't.. plaaaay
We gonna rockin' til the weels fall of
Hold up.. heeeeey
For my niggas who be acting to bold
Take a.. seeeeat
Hope you ready for the next episode heeeeeey
(Music stops and pauses)
Smoke weed everyday
"""

pattern = "melodic|eyes|wander|rambunctious|whirl|provide|ruddy|quaint|sea|snatch|tangy|women|mammoth|peel|home|hope|sense|measure|lake|gleaming|vagabond|phobic|support|boring|discreet|overrated|spoil|load|lick|early|envious|alleged|heady|queen|seed|quiver|squalid|jelly|standing|wreck|slope|farflung|press|run|tender|shrill|duck|actor|foregoing|tickle|haunt|minor|seal|breakable|wren|trick|bitesized|stage|listen|depend|string|abounding|explode|cows|low|creature|compare|habitual|pipe|hand|occur|eye|drunk|furniture|insect|worthless|fang|found|connection|quarter|celery|stretch|rifle|glistening|baby|flesh|invite|motionless|watch|letter|current|succinct|panicky|snail|ear|prevent|nebulous|knife|jolly|dirt|scrawny|zephyr|giraffe|unique|jumbled|curly|fry|blushing|delirious|open|muddled|year|earsplitting|zipper|grandiose|answer|puncture|chase|gentle|middle|nine|overflow|rod|troubled|pop|handy|chalk|thunder|acceptable|bumpy|filthy|majestic|paste|snotty|quack|illated|tired|present|credit|substance|launch|equable|bashful|callous|appreciate|dead|soothe|amused|handsome|nappy|amazing|unbiased|bushes|yellow|way|cakes|cent|tested|letters|colour|apparel|rhythm|shoes|homeless|open|cub|alarm|soak|voracious|chin|flame|skillful|form|possessive|retire|camera|street|condition|gate|scintillating|follow|imported|approval|humdrum|tasty|murky|sprout|morning|picture|happen|observe|bang|slap|cough|five|tie|plan|punish|sneeze|perfect|shake|average|clear|caring|accurate|telephone|skate|daughter|wild|spurious|nutritious|sneaky|breezy|separate|sore|fade|blind|aware|cheat|heat|cowardly|unused|sheet|puny|pump|property|obscene|immense|soggy|shiny|spot|racial|lace|dapper|cheap|fluttering|husky|fire|hammer|aquatic|stick|lamentable|yell|chilly|zoom|animated|living|spell|separate|shade|delicious|deer|suck|squeamish|call|labored|shirt|numerous|push|sleet|price|earthy|ambiguous|porter|chickens|mailbox|omniscient|mourn|descriptive|kiss|polite|changeable|children|cheese|assorted|illustrious|action|serious|dislike|rhetorical|scandalous|nasty|steady|safe|walk|different|example|ring|talk|print|dinosaurs|switch|behave|murder|brown|cooperative|past|proud|disastrous|observant"

(EDIT: Fixed input offset, updated code and timings)

I am not sure if your approach handles multiple and overlapping matches. — maslak, Jul 20 '19 at 22:38
@maslak check out the edit, but I think now handles multiple matches correctly. Not sure what you mean by overlapping matches, but I guess if one word from the word list is part of another word, I guess both mine and the regex approaches perform equally, e.g. `my_fun('ciao pippo ao', 'ciao|ao') == my_fun3('ciao pippo ao', 'ciao|ao') == [0, 2, 11]`. — norok2, Jul 20 '19 at 23:16
Just a minor notice: Your results are all off by one. You added a newline to the beginning of your test string. You can fix that by changing it to ``text = """\``. — Finomnis, Jul 20 '19 at 23:40
@Finomnis I realized that I may not have been using EXACTLY the same input, but did not feel like going through everything again just to fix this... afterall, it's just a different input and OP's code gives identical results as my solution. — norok2, Jul 20 '19 at 23:45
It's fine, it just confused me that I didn't get the same numbers as you did. — Finomnis, Jul 21 '19 at 00:00

Finomnis · Accepted Answer · 2019-07-21T01:12:15.200

As I am quite a rust fanboy, I implemented your problem in rust, once with a simple loop and the built-in match_indices function, and once using the aho-corasick crate. Theoretically, aho-corasick should scale best, in case your real usecase is a lot larger than the example.

Then, after realizing how good the aho-corasick version is, I searched for a direct aho-corasick library for python, and found pyahocorasick.

Python version:

def my_fun(string, some_pattern):
    return [match.start() for match in re.finditer(f'(?=({some_pattern}))', string)]

Python version with pyahocorasick:

def my_fun(string, raw_pattern):
    A = ahocorasick.Automaton()
    for pattern in raw_pattern.split('|'):
        A.add_word(pattern, len(pattern))
    A.make_automaton()

    result = []
    for end_index, original_len in A.iter(string):
        result.append(end_index - original_len + 1)
    return result

Rust version, with match_indices:

use pyo3::prelude::*;

#[pymodule]
fn __lib(_py: Python, m: &PyModule) -> PyResult<()> {

    #[pyfn(m, "my_fun")]
    fn my_fun(_py: Python, string: &str, raw_pattern: &str) -> PyResult<Vec<usize>> {
        let mut results = vec![];

        for pattern in raw_pattern.split('|'){
            results.extend(string.match_indices(pattern).map(|(pos, _content)| pos));
        }
        results.sort();
        results.dedup();

        Ok(results)
    }

    Ok(())
}

Rust version, with aho-corasick 0.7.4:

use pyo3::prelude::*;

use aho_corasick::AhoCorasick;

#[pymodule]
fn __lib(_py: Python, m: &PyModule) -> PyResult<()> {

    #[pyfn(m, "my_fun")]
    fn my_fun(_py: Python, string: &str, raw_pattern: &str) -> PyResult<Vec<usize>> {
        let ac = AhoCorasick::new(raw_pattern.split('|'));

        Ok(ac.find_iter(string)
             .map(|mat| mat.start())
             .collect())
    }

    Ok(())
}

Results:

Python:     1.484 ms
Python AC:  0.214 ms
Rust:       0.506 ms
Rust AC:    0.199 ms

Results for string * 10:

Python:    14.834 ms
Python AC:  0.671 ms
Rust:       2.936 ms
Rust AC:    0.360 ms

Results for string * 100:

Python:    146.940 ms
Python AC:   5.071 ms
Rust:       20.243 ms
Rust AC:     1.758 ms

All tests were measured with timeit for 1000 iterations and displayed in time per iteration.

The rust versions were implemented based on my rust submodule template.

For sake of correctness, those are the outputs of the algorithms:

Python:    [500, 522, 864, 877, 1406, 1727, 1955]
Python AC: [500, 522, 864, 877, 1406, 1727, 1955]
Rust:      [500, 522, 864, 877, 1406, 1727, 1955]
Rust AC:   [500, 522, 864, 877, 1406, 1727, 1955]

Impressive timings with Aho-Corasick. I'll add the timings for easier comparison with my answer. — norok2, Jul 21 '19 at 01:41
The most important part is that it scales well, meaning that it should get a *lot* faster once you have more patterns and a larger string, which can already be seen in my `string*100` results, where it beats the original implementation by a factor of almost 100. — Finomnis, Jul 21 '19 at 02:00

Optimization of list comprehension with string matching

2 Answers2

EDIT

EDIT