Can this regex be achieved with only string methods in Python?

Question

Say I have a string:

foo: bar:baz : moo:mar:maz

I want to count the number of times a colon appears in this string, with a non-whitespace character immediately to the left or right of it. So foo: counts for one instance, bar:baz count for two more, and moo:mar:maz count for four instances total. We count mar twice because it's on both the right and left of a colon. The lone colon : doesn't count for anything, because it's got no adjacent non-whitespace character.

The count for the above string should therefore be 7.

I can do this by regex, as in:

str = "foo: bar:baz : moo:mar:maz"
left = len(re.findall("\S:", str))
right = len(re.findall(":\S", str))
offset = left + right

But I want to do this without regex, as I'm running a script that needs to be as optimised as possible. Is there any way to do this using only string functions?

Here's one method I tried, which basically splits up the string by spaces, then examines each substring and splits that up by colons, counting the number of elements in the resulting list and adding it to the total.

spl = str.split(" ")
count=0
print(spl)
for element in spl:
    subspl = element.split(':')
    print(subspl)
    if len(subspl) > 1:
        count += len([s for s in subspl if s != ''])

This almost works, but it fails on moo:mar:maz - the [s for s in subspl if s != ''] list comprehension returns ['moo', 'mar', 'maz'], which has three elements. This should add four to the total, not three.

Is there a way to do this using only string methods, or which is faster than regexes?

EDIT: An edge case I hadn't considered was pointed out. If the string is foo::bar foo::::bar or foo: bar: I want the code to count 2 in all cases. A colon adjacent to another colon shouldn't count towards the total, so ::: and :: and ::::::: should all count for 0. I only want to record the number of times where a non-colon, non-whitespace character is immediately adjacent to a colon.

Note that "only string methods" doesn't necessarily mean it'll be faster than using regexes. — user2357112, Dec 15 '20 at 19:16
Also, what if the string is something like `':::'`? How should that be counted? — user2357112, Dec 15 '20 at 19:17
*But I want to do this without regex, as I'm running a script that needs to be as optimised as possible.* It is unlikely that such simple regexes are the bottleneck for reasonably sized Python code. — Timur Shtatland, Dec 15 '20 at 19:31
@Tarique I just did a speed test (see my answer) and it turns out string methods actually are much faster. I didn't believe it myself at first. — mapf, Dec 15 '20 at 20:54
@user2357112supportsMonica - It shouldn't count for my purposes, but good point ... my regex solution would probably get that wrong. — Lou, Dec 15 '20 at 22:09
@Lou I've found a way to improve the string method approach a little bit more and also included JeffUK's approach in the speed test. What do you mean by `':::'` should not count though? No method can account for these cases at the moment. — mapf, Dec 16 '20 at 07:59
I mean that I only want to count instances of a colon with a non-whitespace character *other than a colon* to its left or right. So `:::` should not add to the count, but `foo:::` would add 1, `foo::bar` would add 2 etc. I didn't clarify that in my initial requirements because I hadn't thought of it — Lou, Dec 16 '20 at 09:09

Booboo · Answer 1 · 2020-12-15T20:28:40.400

2

You may discover that a regex solution is faster than a non-regex solution. I believe your regex can be, however, improved by using a single regex as follows:

import re

str = "foo: bar:baz : moo:mar:maz"
matches = re.findall(r"(\S(?=:)|(?<=:)\S)", str)
print(len(matches))

By using a lookahead and lookbehind assertion, you are in essence able to support pseudo-overlapping matches without having to use the regex package from the PyPI repository, which supports true overlapping matches. But here you have no need to actually match the colon characters.

Update:

But in case you are interested:

import regex as re

str = "foo: bar:baz : moo:mar:maz"
matches = re.findall(r"(\S:|:\S)", str, overlapped=True)
print(len(matches))

edited Dec 15 '20 at 20:28

answered Dec 15 '20 at 20:23

Booboo

38,656
3
37
60

Hi, I just did a speed test of your first solution, and it turns out, using string methods actually really is faster. Never believed that myself, but for a large dataset, it would make a difference. Can you cross-check that? – mapf Dec 15 '20 at 20:51
@mapf See my comment to your answer. – Booboo Dec 15 '20 at 20:57
Thanks, I was secretly hoping someone would also propose a better solution to my sloppy regex! This looks much neater. Interesting points about the speed tests though ... – Lou Dec 15 '20 at 22:12
I've just realised I can solve the problem of not counting colons adjacent to other colons by modifying your first regex slightly: if I change it to `([^\s:](?=:)|(?<=:)[^\s:])`, it will look for characters adjacent to colons that are neither whitespace nor other colons. So far it's generated all the correct results for me. Not sure how I'd apply it to the non-regex solution, but it's still useful :) – Lou Dec 16 '20 at 12:54
The difference between your regex and mine is the difference between "I want to count the number of times a colon appears in this string, with a non-whitespace character immediately to the left or right of it." and "I want to count the number of times a colon appears in this string, with a non-whitespace character *other than a colon* immediately to the left or right of it." – Booboo Dec 16 '20 at 13:21

JeffUK · Answer 2 · 2020-12-15T19:57:58.640

1

Of course it can be done without regex, the simple approach would be like this: (Simplified but you could easily replace !=' ' with a is_non_whitespace() function)

score = 0
for i in range(0,len(teststring)):
    if teststring[i]==":":
        if i>0 and teststring[i-1]!=' ':
            score+=1
            print(i,teststring[i-1],teststring[i])
            
        print(i,len(teststring))
        if i+1<len(teststring) and teststring[i+1]!=' ':
            score+=1
            print(i,teststring[i],teststring[i+1])
            
print(score)

Based on some rough calculations this is exponentially slower than the regex version. On your example it takes about twice as long to run. On a 2500000 character long string regex is 5 times faster.

edited Dec 15 '20 at 19:57

answered Dec 15 '20 at 19:28

JeffUK

4,107
2
20
34

Thanks, that's useful to know that regex would actually be quicker! You've answered the original question and the XY problem neatly in one ... – Lou Dec 15 '20 at 22:13
Hi, I've also included your approach in my speed test. On my machine it doesn't seem too bad and is only slightly slower than the regex approach. But feel free to check yourself. – mapf Dec 15 '20 at 23:20

mapf · Accepted Answer · 2020-12-16T10:45:05.627

Here is an approach that only uses string methods. I've run speed tests to compare it to the other two methods proposed by @Booboo and @JeffUK.

import time
import re
from functools import partial


def use_string_method(string):
    working_copy = f' {string} '.replace(' : ', '   ')
    total = 0

    for colon_version in [': ', ' :']:
        total += working_copy.count(colon_version)
        working_copy = working_copy.replace(colon_version, '  ')
    total += 2*working_copy.count(':')

    return total


def use_regex(rex, string):
    matches = rex.findall(string)
    total = len(matches)

    return total


def use_for_loop(string):
    total = 0
    for i in range(0, len(string)):
        if string[i] == ":":
            if i > 0 and string[i - 1] != ' ':
                total += 1

            if i + 1 < len(string) and string[i + 1] != ' ':
                total += 1

    return total


def account_for_double_colons_using_string_method(string):
    working_copy = string
    length = len(working_copy) - 1
    while length < len(working_copy):
        length = len(working_copy)
        working_copy = working_copy.replace('::', ': :')

    return working_copy


def account_for_double_colons_using_regex(string):
    working_copy = re.sub('::', ': :', string)

    return working_copy


text = 'foo:: bar:baz : moo::mar::maz:'*10000

for name, method in {
    'string': account_for_double_colons_using_string_method,
    'regex': account_for_double_colons_using_regex,
}.items():
    start = time.time()
    method(text)
    stop = time.time()
    print(f'{name}: {stop-start} seconds')

regex = re.compile(r"(\S(?=:)|(?<=:)\S)")
for name, method in {
    'string': use_string_method,
    'regex': partial(use_regex, regex),
    'loop': use_for_loop,
}.items():
    start = time.time()
    for _ in range(100):
        method(text)
    stop = time.time()
    print(f'{name}: {stop-start} seconds')

To my surprise I found that the version using string methods seems to be more than 10 times faster than the other two:

# for 1000 runs @ 1000 times the original string:
string: 0.13166213035583496 seconds
regex: 2.456428050994873 seconds
loop: 3.813805341720581 seconds

Note the extra spaces in the replacement strings (' '), to make them as long as the ones they replace. By accident I found this to be important; otherwise unnecessary time is wasted on shortening the text (I guess it was something to do with memory allocation).

UPDATE

Added two approaches to account for double colons, one using string methods, the other using regex. Even though string methods require a while loop, they appear to be superior again:

# for one run @ 10000 times the original string
string: 0.0009975433349609375 seconds
regex: 0.003989458084106445 seconds

Surprisingly, even in an extreme case, the string methods are still slightly faster than regex:

# for one run @ ':'*1000000
string: 0.017957448959350586 seconds
regex: 0.024881601333618164 seconds

There is an overhead in compiling the regex expression. This needs to be amortized against a real-world run, which I assume is not just "foo: bar:baz : moo:mar:maz". Otherwise, who cares how fast it runs? So my assumption is that the actual string is *large*. How large? I don't know. But what if it were megabytes? Did you first compile the regex and then time it from that point? That would be a fairer measurement. — Booboo, Dec 15 '20 at 20:54
Yeah, I have no idea. Again, I never thought this would be faster. But it seems to be quite significant and independent of string length and number of method call repetitions. — mapf, Dec 15 '20 at 20:57
Can you elaborate on the compiling part? I'm not an expert in regex, so I probably didn't do that but also don't know how. Maybe run the rest yourself and see what is says? — mapf, Dec 15 '20 at 20:59
You are timing the wrong thing. You need to separately compile the regex: `rex = re.compile(r"(\S(?=:)|(?<=:)\S)")`. Then pass `rex` to function `use_regex` as an argument and then that function executes `rex.findall(string)`. Or `rex` can be a global variable. I can update your answer with the change, if you want. — Booboo, Dec 15 '20 at 21:01
I see! Thanks. I'll test that instead. Sure, please go ahead and edit it! If you like, you can also add your other version. I couldn't run that myself though because I am missing the module. — mapf, Dec 15 '20 at 21:06
You also need to take the length of `matches` for my code to be complete. Also, there is the overhead of the function call that should be minimized by perhaps performing the computation many times for each version. — Booboo, Dec 15 '20 at 21:10
I edited the code accordingly (I think). If there is something missing / wrong, feel free to make changes. — mapf, Dec 15 '20 at 21:15
Thanks for this, it's a great answer and does answer the original question I asked. I only later realised that I had an extra requirement: that colons not be counted as the character to the left or right of another colon. In that sense, this method overcounts - but then so would my original regex too. I don't suppose you can think of a way that your method could capture non-colon characters adjacent to colons? — Lou, Dec 16 '20 at 09:23
Fantastic. Thanks for an excellent answer and for taking the time to time all of these methods! — Lou, Dec 16 '20 at 10:24
You're welcome! I thought it was an interesting problem :) a final note: I just tested that even in the extreme case of only colons (say `10000*':'`), the string method is still slightly faster than the regex. — mapf, Dec 16 '20 at 10:43

Can this regex be achieved with only string methods in Python?

3 Answers3