0

I need to build a function that removes all digits from a string in Python, however my question isn't about how to do it (because I know from previous posts here)

I have a question about the complexity of these two functions, one is using filter and lambda function to remove digits and the other is using regex, from basic tests with very long strings (~67,912,000 chars +)
I came to a conclusion that regex is faster, however I do not know the actual complexity and so my 'calculation' is not correct/is biased on computer specs (maybe my computer is fast and other's will be slow or vice versa...)
Reading from the internet about the complexity of the filter function and complexity of regex I understood that they are both O(n)..

So, which one is 'overkill' in memory/time complexity and which one isn't. Which one should you use (As a better practice)?:

import time
import re

def dwf(word):
    start = time.time()
    word = ''.join((filter(lambda x: x.isalpha(), word)))
    end = time.time()
    return end - start



def dwr(word):
    start = time.time()
    word = re.sub(r"\d+", '', word)
    end = time.time()
    return end-start

For example, for me:
if the word is 67 million ~chars the difference in time was close to 7 seconds! (when filter is longer and regex is faster)

  • If you care about speed, you should compile your regular expression (once). – Tom Karzes Apr 18 '20 at 23:38
  • @TomKarzes What do you mean by 'compile (once)' , or I didn't quite get your answer and you're basically saying that using filter is an overkill and I should use regex –  Apr 18 '20 at 23:39
  • @TomKarzes - `re` keeps a cache of recent regex's. As long as you aren't using a large number of them, there is no need to precompile. – tdelaney Apr 18 '20 at 23:41
  • Even if they both have the same complexity they could still be linearly 100s of times different. Regex is written in C and does its work on just a couple of buffers. In your filter example, you need to create an object per char and call that lambda function. No competition here! – tdelaney Apr 18 '20 at 23:43
  • Remember that time complexity only shows the way a function *grows*, it doesn't say anything about what is actually faster. `O(n^2)` might only take 100ms to execute for `n = 100` while `O(1)` (constant time) might take 1 minute for any size of the input. The total running time doesn't change the complexity of the algorithm. – VLAZ Apr 18 '20 at 23:47
  • @JeItay - `re.sub` is a helper for `re.compile(...).sub(...)`. If you use the regex multiple times, you may save a bit of time by compiling the regex once. But `re.sub` also uses a cache of recently compiled regex's, so the savings are only there if you use a lot of different regexs. – tdelaney Apr 18 '20 at 23:47
  • @tdelaney At the very least, compiling them saves a hash table lookup. Perhaps not a large improvement, but it's certainly a low-hanging fruit. – Tom Karzes Apr 18 '20 at 23:50
  • @JeItay No, I'm just saying you can make a small improvement in the regex speed by compiling it and then using the compiled one directly, rather than having Python look it up in its cache each time. – Tom Karzes Apr 18 '20 at 23:51
  • @tdelaney What do you mean by 'regex is written in C' - is the python regex implemented using C ? or every programming language using C in order to use regex? –  Apr 19 '20 at 00:13
  • 1
    Python is a great language but its object model is fairly bulky. When you do `filter(..., word)` every character in word needs to be converted to a `str` with a single character (~50 bytes), a frame object needs to be created for the `lambda`, etc.... Python's regex engine is implemented in C, so it can do its work with the raw characters in memory. Any language can implement regex natively, but C based languages like python (cpython anyway) often drop back to C for this operation because of the performance advantage. – tdelaney Apr 19 '20 at 00:22

2 Answers2

0

Since you already know that it takes O(n), I would say further you need to figure out your input cases(here it is word) i.e, best and worst cases to see if you get same time complexity of O(n) for the approaches you are trying.

For space complexity I see that both methods are giving you O(1) on an average.

Shakeel
  • 1,869
  • 15
  • 23
0

Different methods are O(n) but some methods are much faster due to a different scaling factor.

For instance timing test shows maketrans is much faster than the OP methods.

Method

def dwf(word):
    " Filter method "
    return ''.join((filter(lambda x: x.isalpha(), word)))


def dwr(word):
    " Regex method "
    return  re.sub(r"\d+", '', word)

def trans(word):
    " Maketrans method "
    translation = str.maketrans("", "", string.digits)
    return word.translate(translation)

Testing

Use timeit for time measurement since its the most reliable.

Setup

Random string upper/lower case + digits

import string
import random

N = 1000000  # Million characters
word = ''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase + string.digits) for _ in range(N))

Results

import timeit

Filter

%timeit dwf(word)
 N = 1000 -> timeit: 280 us
 N = 1 M -> timeit:  241 ms

Regex

%timeit dwr(word)
N = 1000 -> timeit: 84.6 us
N = 1 M  -> timeit: 87.5 ms

Maketrans

%timeit trans(word)
N = 1000 -> timeit: 11.4 us
N = 1 M -> timeit:   3.78 ms

Conclusion

By increasing the number of points from 1K to 1M Filter and Regex times increased by ~1000 (linear)

Maketrans increased by only ~300 which means its more efficient at processing larger strings.

With N = 1000

  Maketrans is 24X filter

  Maketrans is 7X Regex

With N = 1 M

  Maketrans is 63X filter

  Maketrans is 23X regex
DarrylG
  • 16,732
  • 2
  • 17
  • 23