5

At the texts that I have, I want to replace the following special characters with a single space:

symbols = ["`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "-", "+", "=", "{", "[", "]", "}", "|", "\\", ":", ";", "\"", "<", ",", ">", ".", "?", "/"]

What is the most efficient way (in terms of time of code execution) to do this?

For example, I want this:

(Hello World)] *!

to become this:

Hello World

The candidate methods seem to be the following:

  1. list comprehension
  2. .replace()
  3. .translate()
  4. regular expressions
Outcast
  • 4,967
  • 5
  • 44
  • 99
  • 1
    Please clarify. Do you want to replace each character with a space? Or do you want to remove each character entirely, replacing it with nothing? Because `(Hello World)] *!` does not become `Hello World` when you replace all of its special characters with spaces. It becomes `[one space]Hello World[five spaces]`. – Kevin May 30 '19 at 13:07
  • @Kevin, can you please do both or at least the latter? – Outcast May 30 '19 at 13:08

6 Answers6

8

For an efficient solution you could use str.maketrans for this. Note that once the translation table is defined, it's onle a matter of mapping the characters in the string. Here's how you could do so:

symbols = ["`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "-", "+",
           "=", "{", "[", "]", "}", "|", "\\", ":", ";", "\"", "<", ",", ">", ".", "?", "/"]

Start by creating a dictionary from the symbols using dict.fromkeys setting a single space as value for each entry and create a translation table from the dictionary:

d = dict.fromkeys(''.join(symbols), ' ')
# {'`': ' ', ',': ' ', '~': ' ', '!': ' ', '@': ' '...
t = str.maketrans(d)

Then call the string translate method to map the characters in the above dictionary with an empty space:

s = '~this@is!a^test@'
s.translate(t)
# ' this is a test '
yatu
  • 86,083
  • 12
  • 84
  • 139
  • Thanks for the answer (upvote). Can you please explain a bit the `"\\"` and the `"\""`? It is quite unclear to me what these remove. Apparently, they way they are written has to do too with the fact that the backslash is used for escaping. – Outcast May 30 '19 at 15:12
  • They will be removing the characters \ and " respectively, however they need to be espaced as they have a special meaning. This is done by prepending with a backslash @PoeteMaudit – yatu May 30 '19 at 15:25
  • You're welcome. Don't forget you can accept if it solved it for you :) @PoeteMaudit – yatu May 30 '19 at 15:27
  • Sure it solved for me but some guys below had done a complete comparision so not sure which solution to accept :D – Outcast May 30 '19 at 15:35
  • All the variants of str.translate() will have roughly the same performance and will be much faster than other alternatives. You should accept this answer if you want to replace symbols with spaces (or mine if you want to remove them completely :). – Alain T. May 30 '19 at 15:40
  • Sure @PoeteMaudit :) Do note though that the purpose of accepting an answer is providing a useful reference for future visitors that might be seeking for a similar solution. So it is always advisable to accept the optimal answer for a given problem (although of course, timings are always nice to have) – yatu May 30 '19 at 16:06
5

After launching some tests, I can say that str.translate() is the best variant.

Input data:

symbols = {"`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "-", "+", "=", "{", "[", "]", "}", "|", "\\", ":", ";", "\"", "<", ",", ">", ".", "?", "/"}
translate_table = {126: None, 93: None, 91: None, 125: None, 92: None, 42: None, 45: None, 94: None, 62: None, 47: None, 35: None, 59: None, 44: None, 58: None, 60: None, 124: None, 61: None, 36: None, 95: None, 43: None, 96: None, 123: None, 64: None, 33: None, 38: None, 63: None, 46: None, 34: None, 41: None, 37: None, 40: None}
regular_expression = "[`~!@#$%^&*()_\-+={[\]}|\\:;\"<,>.?/]"
small_document = "Some**r@an]]\"dom t##xt"
normal_document = "TbsX^Kt$FZ%haZe+sLxu:Al\"xNAL\\Kix[mHp_gn]PrG`DqGd~GdNc;BoEq.SYD?Rp>ukq,UfO<XdTc=RUH}oifc&oP!CB*me@Qv{Qf-Li)gmXL/IQH#mne(Khaj|"
big_document = "QOfY+dymyoGBAxTAoIeM+jEWlaECUZEUXuMvprJOqFtQR*OiHtTFZkUNbYipSTTDPOVkIdGTcjWrQmbmthKBHBSEOZ)lQAIJOrVgmGGFdtqbuFfj<Dls<JWtKczAFMPYMemiJBJHdPeeul\\x>lGIBvUsxBokagvVovrrdxdKMtAKx>MEexYv>DGqPUXYaBQKwiSIUobrPQYjilhHMQunE;RiqOZPTnyOEgRrpxcuobvvmGkFpTqgMxYYhrmRRnauiqgvCmZ\"UauceaXsgAMSakxewzPrlIrYkVCVZaEGh]qiizYyzbkcHPF@qQsQMfHPDEbEnWtrCFoARUYAloOcctqmL@hegZbfhsHaJOxOxzQhZAVjVDgokosATfhKMT!WYyPWKcKAHKCzQGGJOCglYGZbftsuyntXZUKNqgGlsLJqgN,pUcOoA/tStXFXgpoSErgvw/OUMPWjJwt=bhMAIDayOZXJm=ifYYUuAvSIZjwnBfktNvEvZmvQso%HiNZEVqoDR%nQBtCkhjSfVfDuRSRsvp-sCunjDDUYSEVLICQdisxhEfqkUTkiPlLiUNNwrvO#WTDmweZyMeIbgNXkIsvaJeHYXV(HvRcGNZM(PPRIAyyLWivGiqMVBtwObqLfEEISyyjGNEdUU:ys`dXcVawkIEAjFXky`RUXNTm`LDM}mwTOcmsSo}haJXPnkwOhKLYwve}SWifzKq}grw}fMSQXXWguUQtlWpPZQymR^wBKEyolFlZnzEEmehSNenOqDOHWRit[Npm?R?DIPXAmQYYBbmJofxUzzWBsVCoPI?VmpXhoMxCfXyHEHowXzIJvExThiffLhBTtma_jk_NrbkPCGGypXvOuBqBxDYfC{bwIHoaqnJSKytxwWXBNnKG~PKuQklGblEwH~rJoGpKZmm~tTEFnPLdmzfrqJibMYIykzL$RZLPmsZjB$AAbZwFnByOydEOIfFvTaEQaSjbpeBZuUGY&ZfPQgLihmPYrhZxSwMzLrNF.WjFiDCLyXksdkLeMHVCfrdgCAotElQ|"
no_match_document = "XOtasggWqhtSLJpHEGoCmMRepFBlRfAGKTLPcEtKonFVsPgvWgAbvJVeMWILPgLapwAmTgXWVbxOJtUFmMygzIqYPqyAxzwElTFyYcGdtnNa"

Code:

def func1(doc):
    for c in symbols:
        doc = doc.replace(c, "")
    return doc


def func2(doc):
    return doc.translate(translate_table)


def func3(doc):
    return re.sub(regular_expression, "", doc)


def func4(doc):
    return "".join(c for c in doc if c not in symbols)

Test results:

func1(small_document):      0.701037002
func1(normal_document):     1.1260866900000002
func1(big_document):        3.4234831459999997
func1(no_match_document):   0.7740780450000004

func2(small_document):      0.14135037500000003
func2(normal_document):     0.5368806810000004
func2(big_document):        0.8128472860000002
func2(no_match_document):   0.394245089

func3(small_document):      0.3157141610000007
func3(normal_document):     0.927359323000001
func3(big_document):        1.9310377590000005
func3(no_match_document):   0.18656399199999996

func4(small_document):      0.3034549070000008
func4(normal_document):     1.3695875739999988
func4(big_document):        10.115730064
func4(no_match_document):   1.2086623230000022

UPD.

Input data I've provided have been "prepared" specially for pure method testing.

To generate translate_table I've used next dict comprehension:

translate_table = {ord(s): None for s in symbols}

Here is link to website for regex validation (it could be helpful).


In case if you want to recalculate tests by yourself, here is code:

    if __name__ == '__main__':
    import timeit
    print("func1(small_document)", timeit.timeit("func1(small_document)", setup="from __main__ import func1, small_document", number=100000))
    print("func1(normal_document): ", timeit.timeit("func1(normal_document)", setup="from __main__ import func1, normal_document", number=100000))
    print("func1(big_document): ", timeit.timeit("func1(big_document)", setup="from __main__ import func1, big_document", number=100000))
    print("func1(no_match_document): ", timeit.timeit("func1(no_match_document)", setup="from __main__ import func1, no_match_document", number=100000))

    print("func2(small_document): ", timeit.timeit("func2(small_document)", setup="from __main__ import func2, small_document", number=100000))
    print("func2(normal_document): ", timeit.timeit("func2(normal_document)", setup="from __main__ import func2, normal_document", number=100000))
    print("func2(big_document): ", timeit.timeit("func2(big_document)", setup="from __main__ import func2, big_document", number=100000))
    print("func2(no_match_document): ", timeit.timeit("func2(no_match_document)", setup="from __main__ import func2, no_match_document", number=100000))

    print("func3(small_document): ", timeit.timeit("func3(small_document)", setup="from __main__ import func3, small_document", number=100000))
    print("func3(normal_document): ", timeit.timeit("func3(normal_document)", setup="from __main__ import func3, normal_document", number=100000))
    print("func3(big_document): ", timeit.timeit("func3(big_document)", setup="from __main__ import func3, big_document", number=100000))
    print("func3(no_match_document): ", timeit.timeit("func3(no_match_document)", setup="from __main__ import func3, no_match_document", number=100000))

    print("func4(small_document): ", timeit.timeit("func4(small_document)", setup="from __main__ import func4, small_document", number=100000))
    print("func4(normal_document): ", timeit.timeit("func4(normal_document)", setup="from __main__ import func4, normal_document", number=100000))
    print("func4(big_document): ", timeit.timeit("func4(big_document)", setup="from __main__ import func4, big_document", number=100000))
    print("func4(no_match_document): ", timeit.timeit("func4(no_match_document)", setup="from __main__ import func4, no_match_document", number=100000))
Olvin Roght
  • 7,677
  • 2
  • 16
  • 35
  • Hey thanks I really wanted to see this complete comparison (upvote). However, it is not clear from you code what are the values of your variables e.g. `translate_table` etc – Outcast May 30 '19 at 13:22
  • what is the regular expression that you are using? That will determine the complexity, ie time. eg use `re.sub('(?i)[^a-z ]+','',doc)` – Onyambu May 30 '19 at 13:24
  • @PoeteMaudit, I've added input data in post. – Olvin Roght May 30 '19 at 13:24
  • also with your translation table, you already have the values in ascii. Use the original values instead of assci for comparison – Onyambu May 30 '19 at 13:25
  • @Onyambu to be honest I have not used any yet because it is difficult to come up with one without errors - this is the problem with regular expression; quite hard to learn and write. You could have a look here: https://stackoverflow.com/questions/56376461/remove-a-big-list-of-of-special-characters – Outcast May 30 '19 at 13:26
  • @Onyambu, I've done it to minimize time consumption in process of test. Of course, it's not the way you should use in real code. – Olvin Roght May 30 '19 at 13:26
  • Ok @OlvinRoght so I assume that the time comparison is accurate then. – Outcast May 30 '19 at 13:27
  • That simply means you are giving it a headstart. So it will have greater performance. If that is the case, you can then use `ur'\p{P}+'` for the regex to remove all punctuations. I mean your comparison isnt fair – Onyambu May 30 '19 at 13:29
  • @Onyambu, I've removed "headstart" to calculate how much it will take for particular method to do replacements. Anyway, I've added `tt = {ord(s): None for s in symbols}` and it still faster then re ~50%. – Olvin Roght May 30 '19 at 13:35
  • Although keep in mind that your times seems quite different than the times of @vurmux above (unless I am missing something). (Please do not misunderstand me; It looks good but I just want to be sure about it) – Outcast May 30 '19 at 13:36
  • @PoeteMaudit, I've tested it on Mac mini, he has, probably, more powerfull pc :D – Olvin Roght May 30 '19 at 13:37
  • @PoeteMaudit, I've added code for test, you can test everything by yourself ;) – Olvin Roght May 30 '19 at 13:41
  • @PoeteMaudit, I've added tests for different string lengths, you can check. – Olvin Roght May 30 '19 at 14:10
  • This seems to really depend on input and translation table. I've just tried for us (input is some test logs up to ~60 Mb in size and translation is escaping HTML symbols + remove ~4 "bad" symbols, so ~10 replaces in total) and `translate` was actually ~2 times slower than `replace`. – The Godfather Feb 03 '22 at 11:29
  • @TheGodfather, indeed. The expected application of `.translate()` is quite different from replacing "bad" symbols. It could be used for replacing, but as it requires iterating over translation table it might be slower than `.replace()` in some cases. – Olvin Roght Feb 03 '22 at 11:38
  • @OlvinRoght I've added my own answer with my findings. This was counter-intuitive, because I expect `replace` to iterate many times over the whole string and `translate` to iterate over the string just once iterating over trans table instead... while `replace` is `O(K*N)` (where `K` is length of translation table and `N` is length of the string) and `translate` is `O(N)` (given that lookup in translation table is `O(1)` I would really expect translate to be faster. – The Godfather Feb 03 '22 at 11:54
1
s = '''
def translate_():
    symbols = '`,~,!,@,#,$,%,^,&,*,(,),_,-,+,=,{,[,],},|,\,:,;,",<,,,>,.,?,/'
    s = '~this@is!a^test @'
    t = str.maketrans(dict.fromkeys(symbols, ' '))
    s.translate(t)
    return s

def replace_():
    symbols = '`,~,!,@,#,$,%,^,&,*,(,),_,-,+,=,{,[,],},|,\,:,;,",<,,,>,.,?,/'
    s = '~this@is!a^test @'
    for symbol in symbols:
        s = s.replace(symbol, ' ')
    return s
'''

print(timeit.timeit('replace_()', setup=s, number=100000))
print(timeit.timeit('translate_()', setup=s, number=100000))

Will print:

0.7663131961598992

0.4139239452779293

So replacing with translate is nearly 2 times faster than using several replaces.

Community
  • 1
  • 1
vurmux
  • 9,420
  • 3
  • 25
  • 45
  • Upvote for showing `timeit` results. But note that the results vary with the length of the string and how many chars that need to be replaced. – Ralf May 30 '19 at 13:49
1

My code replaces symbols with spaces and does NOT remove those spaces.

For short strings .join() is fast, but for larger strings .translate() is faster if there is a lot to replace. Surprisingly, .replace() is still very fast if there are few replacements to be made.

text: '(Hello World)] *!'
using_replace                     0.046
using_join                        0.016
using_translate                   0.031

text: '~this@is!a^test@'
using_replace                     0.046
using_join                        0.017
using_translate                   0.029

text: '~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@'
using_replace                     0.195
using_join                        2.327
using_translate                   0.061

text: 'a long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replace'
using_replace                     0.051
using_join                        2.100
using_translate                   0.064

Comparing some strategies:

def using_replace(text, symbols_to_replace, replacement=' '):
    for char in symbols_to_replace:
        text = text.replace(char, replacement)

    return text

def using_join(text, symbols_to_replace, replacement=' '):
    return ''.join(
        replacement if char in symbols_to_replace else char
        for char in text)

def using_translate(text, symbols_to_replace, replacement=' '):
    translation_dict = str.maketrans(
        dict.fromkeys(symbols_to_replace, replacement))

    return text.translate(translation_dict)

with this timeit code for different texts:

    # a 'set' for faster lookup
    symbols = {
        '`', '~', '!', '@', '#', '$', '%', '^', '&', '*',
        '(', ')', '_', '-', '+', '=', '{', '[', ']', '}',
        '|', '/', ':', ';', '"', '<', ',', '>', '.', '?',
        '\\',
    }

    text_list = [
        '(Hello World)] *!',
        '~this@is!a^test@',
        '~/()&this@isasd!&=)(/as/dw&%#a^test@' * 1000,
        'a long text without chars to replace' * 1000,
    ]
    for s in text_list:
        assert (
                using_replace(s, symbols)
                == using_join(s, symbols)
                == using_translate(s, symbols))

    for s in text_list:
        print()
        print('text:', repr(s))
        for func in [using_replace, using_join, using_translate]:
            t = timeit.timeit(
                'func(s, symbols)',
                'from __main__ import func, s, symbols',
                number=10000)
            print('{:30s} {:8.3f}'.format(func.__name__, t))
Ralf
  • 16,086
  • 4
  • 44
  • 68
  • To note: `.translate()` shows linear time depending on the length of the string (`O(n)`). – Ralf May 30 '19 at 13:38
  • On the other hand, `.replace()` is fast if there are few replacements and slow if there are many replacements to be made. – Ralf May 30 '19 at 13:39
  • Looks interesting and comprehensive, thanks (upvote). So could you say what is the big-O computation complexity of these methods? – Outcast May 30 '19 at 13:42
  • `.translate()` is linear (but a ver small coeficient, smaller than 1) and depends on the length of the string and (to a lesser extent) on the size of the translation table. – Ralf May 30 '19 at 13:44
  • `.join()` and `replace()` are more difficult to pinpoint, but probably also linear (with coeficient 1). And they are also impacted a lot by how many chars need to be replaced, so it is very variable and NOT just a simple linear complexity. – Ralf May 30 '19 at 13:46
0

str.translate() is indeed the fastest method. Here's a concise way to build the translation table for exclusion of characters:

symbols = ["`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "-", "+", "=", "{", "[", "]", "}", "|", "\\", ":", ";", "\"", "<", ",", ">", ".", "?", "/"]
removeSymbols = str.maketrans("","","".join(symbols))

cleanText = "[Hello World] *!".translate(removeSymbols)
print(cleanText) # "Hello World "

The maketrans() functions can take 3 parameters, the first one is a string with the characters to replace, the second one is their replacements and the third one is a list of characters that should be removed. To bluntly remove all characters, we just need to supply the 3rd parameter with a string containing the symbols to remove.

The translation table removeSymbols then performs a complete removal of the characters in the symbols list.

To replace with spaces, build the translation table like this:

removeSymbols = str.maketrans("".join(symbols)," "*len(symbols))
Alain T.
  • 40,517
  • 4
  • 31
  • 51
  • Hm looks interesting and it works (upvote). But I would like to understand a bit what it does. What is the point of `*3`? Also do you replace these symbols with whitespace or you add nothing there? – Outcast May 30 '19 at 15:20
  • @Poete Maudit, this was actually more complicated than it needed to be. I didn't have to supply the symbols string 3 times. It was enough to only use it for the 3rd parameter of maketrans(). I adjusted the example. – Alain T. May 30 '19 at 15:29
  • Yes, actually I think that in this respect @yatu's answer above is the most concise one. – Outcast May 30 '19 at 15:36
0

While Roght's answer is the best IMO and it shows the objective approach, I'd like to notice that translate is not always the best! You really need to check it out yourself, the result will depend on your inputs.

A bit of complexity theory

(disclaimer: I haven't looked into Python source code, so below is what I would expect), given we have K symbols to replace and N symbols in source string:

str.replace should basically iterate over the whole string checking every symbol and replace it if it's matching the parameter. Looks like pure O(N) , thus for K replacements it will be O(K*N).

On the other hand, translate should iterate over the whole string just once checking every symbol for a match in translation table. As translation table is a hashmap, lookup there is O(1), thus the whole translation doesn't depend on K at all, should be O(N)

Question - why is then replace faster in my case??? I don't know :(


I came around this when I was refactoring our script analyzing test logs (quite big files, think 60Mb+) and it was cleaning it up from some random symbols as well as doing some HTML-sanitization, here is the replacement dictionary:

replace_dict = {
        "&": "&amp;",
        "\"": "&quot;",
        "<": "&lt;",
        ">": "&gt;",
        "\u0000": "",
        "\u0007": "",
        "\u0008": "",
        "\u001a": "",
        "\u001b": "",
    }

When I saw initial code just having 9 str.replace calls in the row this was my first thought - "wtf, let's use translate instead", this must be much faster. However in my case I found out that replace is actually the fastest method.

Test script:

replace_dict = {
    "&": "&amp;",
    "\"": "&quot;",
    "<": "&lt;",
    ">": "&gt;",
    "\u0000": "",
    "\u0007": "",
    "\u0008": "",
    "\u001a": "",
    "\u001b": "",
}

symbols = list(replace_dict.keys())
translate_table = {ord(k): v if v else None for k, v in replace_dict.items()}
with open("myhuge.log") as f:
    big_document = f.read()


def func_replace(doc):
    for k, v in replace_dict.items():
        doc = doc.replace(k, v)
    return doc


def func_trans(doc):
    return doc.translate(translate_table)


def func_list_comp(doc):
    # That's not really equivalent to two methods above, but still good for perf comparison
    return "".join(c for c in doc if c not in symbols)


if __name__ == '__main__':
    import timeit
    number = 5
    print("func_replace(big_document): ", timeit.timeit("func_replace(big_document)",
          setup="from __main__ import func_replace, big_document", number=number))

    print("func_trans(big_document): ", timeit.timeit("func_trans(big_document)",
          setup="from __main__ import func_trans, big_document", number=number))

    print("func_list_comp(big_document): ", timeit.timeit("func_list_comp(big_document)",
          setup="from __main__ import func_list_comp, big_document", number=number))

So here are the results:

func_replace(big_document): 4.945449151098728

func_trans(big_document): 15.22288554534316

func_list_comp(big_document): 45.01621600985527

I can make two conclusions out of it:

  • List comprehension is really slow, don't use it.
  • Counter-intuitively, replace can be few times faster than translate for some cases. If your replacement table is not too big and the strings you're working on are too big, seems like replace would be better.
The Godfather
  • 4,235
  • 4
  • 39
  • 61
  • `symbols = list(replace_dict.keys())` => `symbols = set(replace_dict)` – Olvin Roght Feb 03 '22 at 12:07
  • You can try also read file as binary (`"rb"`) and use `bytes.translate()` which might be significantly faster. – Olvin Roght Feb 03 '22 at 12:15
  • @OlvinRoght I'm not sure I get how can I use `bytes.translate` for this case as it seems it works only for single-char replacements, not for string replacements (and e.g. `"` is a 6-chars string). And about changing `\u..` chars - does it actually make any difference? I feel like `\u`-format is looking more consistent and readable... – The Godfather Feb 03 '22 at 12:34
  • I found option with `\xFF` more readable, okay. But recommendation from first comment still important. Also, change `"".join(...)` to `"".join([...])`, it will also boost last method a bit. – Olvin Roght Feb 03 '22 at 13:03
  • @OlvinRoght but it doesn't matter at all, does it? We're not optimizing test program itself but the actual replace methods. This is general code being executed for all test methods in `setup`, so even if I add `sleep(1)` at the top of the file, it won't change anything. – The Godfather Feb 03 '22 at 13:09
  • `func_list_comp` uses `symbols` and it will demonstrate slightly better performance if `symbols` would be initialized as `set`. – Olvin Roght Feb 03 '22 at 13:11
  • Okay, I found what happens. You will also find out if you will use next translate table `{ord("&"): "a", ord('"'): "q", ord("<"): "l", ord(">"): "g", 0: None, 7: None, 8: None, 0x1a: None, 0x1b: None}`. There're two translate functions: [`unicode_fast_translate()`](https://github.com/python/cpython/blob/f4c03484da59049eb62a9bf7777b963e2267d187/Objects/unicodeobject.c#L9176) and [`_PyUnicode_TranslateCharmap()`](https://github.com/python/cpython/blob/f4c03484da59049eb62a9bf7777b963e2267d187/Objects/unicodeobject.c#L9227) which calls first if certain conditions happens. – Olvin Roght Feb 03 '22 at 14:57
  • Basically, if your replacement string is not 1 char length `str.translate` will proceed with complicated method which demonstrates significantly slower productivity. You can dig into sources and make comparison with [string replacement functions](https://github.com/python/cpython/blob/fb44d0589615590b1e7895ba78a038e96b15a219/Objects/stringlib/transmogrify.h#L275) to include some explanations in your answer. – Olvin Roght Feb 03 '22 at 15:02