2

I am used to code in C/C++ and when I see the following array operation, I feel some CPU wasting:

version = '1.2.3.4.5-RC4'                 # the end can vary a lot
api = '.'.join( version.split('.')[0:3] ) # extract '1.2.3'

Therefore I wonder:

  • Will this line be executed (interpreted) as creation of a temporary array (memory allocation), then concatenate the first three cells (again memory allocation)?
    Or is the python interpreter smart enough?
    (I am also curious about optimizations made in this context by Pythran, Parakeet, Numba, Cython, and other python interpreters/compilers...)

  • Is there a trick to write a replacement line more CPU efficient and still understandable/elegant?
    (You can provide specific Python2 and/or Python3 tricks and tips)

oHo
  • 51,447
  • 27
  • 165
  • 200
  • 3
    This is executed just once, isn't it? Why worry about the performance of this **at all**? No, Python won't optimise this; it'll create a list, then create another string. – Martijn Pieters Dec 02 '14 at 10:56
  • [Your code looks fine, get used to also feel fine](http://stackoverflow.com/a/27248431/2932052) – Wolf Dec 02 '14 at 11:44

3 Answers3

2

I have no idea of the CPU usage, for this purpose, but isn't it why we use high level languages in some way?

Another solution would be using regular expressions, using compiled pattern should allow background optimisations:

import re
version = '1.2.3.4.5-RC4'
pat = re.compile('^(\d+\.\d+\.\d+)')
res = re.match(version)
if res:
  print res.group(1)

Edit: As suggested @jonrsharpe, I did also run the timeit benchmark. Here are my results:

def extract_vers(str):
   res = pat.match(str)
   if res:
     return res.group(1)
   else:
     return False

>>> timeit.timeit("api1(s)", setup="from __main__ import extract_vers,api1,api2; s='1.2.3.4.5-RC4'")
1.9013631343841553
>>> timeit.timeit("api2(s)", setup="from __main__ import extract_vers,api1,api2; s='1.2.3.4.5-RC4'")
1.3482811450958252
>>> timeit.timeit("extract_vers(s)", setup="from __main__ import extract_vers,api1,api2; s='1.2.3.4.5-RC4'")
1.174590826034546

Edit: But anyway, some lib exist in Python, such as distutils.version to do the job. You should have a look on that answer.

Community
  • 1
  • 1
Aif
  • 11,015
  • 1
  • 30
  • 44
  • I would be surprised if a regular expression was noticably faster than splitting. – Hannes Ovrén Dec 02 '14 at 10:58
  • Just checked, the split version is faster. 2.33 us vs 0.732 us. – Hannes Ovrén Dec 02 '14 at 11:03
  • @HannesOvrén it's curious, regarding my last edit, I really do not have the same results obviously. How did you compare? – Aif Dec 02 '14 at 11:16
  • I used the `%timeit` magic in IPython. I did however call the regex with `res = re.match(pat, version)` instead of `res = pat.match(version)`. The latter makes them more or less the same speed (884 ns). – Hannes Ovrén Dec 02 '14 at 11:58
  • Still, I'd say this is one of those times I realize that regexes are not always the best tool. It is not faster, and I think the split-version is a lot cleaner to read and understand. There are of course times when regexes are the correct choice, but I don't think this is one of them. – Hannes Ovrén Dec 02 '14 at 12:00
2

To answer your first question: no, this will not be optimised out by the interpreter. Python will create a list from the string, then create a second list for the slice, then put the list items back together into a new string.

To cover the second, you can optimise this slightly by limiting the split with the optional maxsplit argument:

>>> v = '1.2.3.4.5-RC4'
>>> v.split(".", 3)
['1', '2', '3', '4.5-RC4']

Once the third '.' is found, Python stops searching through the string. You can also neaten slightly by removing the default 0 argument to the slice:

api = '.'.join(version.split('.', 3)[:3])

Note, however, that any difference in performance is negligible:

>>> import timeit
>>> def test1(version):
    return '.'.join(version.split('.')[0:3])

>>> def test2(version):
    return '.'.join(version.split('.', 3)[:3])

>>> timeit.timeit("test1(s)", setup="from __main__ import test1, test2; s = '1.2.3.4.5-RC4'")
1.0458565345561743
>>> timeit.timeit("test2(s)", setup="from __main__ import test1, test2; s = '1.2.3.4.5-RC4'")
1.0842980287537776

The benefit of maxsplit becomes clearer with longer strings containing more irrelevant '.'s:

>>> timeit.timeit("s.split('.')", setup="s='1.'*100")
3.460900054011617
>>> timeit.timeit("s.split('.', 3)", setup="s='1.'*100")
0.5287887450379003
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
0

I am used to code in C/C++ and when I see the following array operation, I feel some CPU wasting:

A feel of CPU wasting is absolutely normal for C/C++ programmers facing python code. Your code:

version = '1.2.3.4.5-RC4'                 # the end can vary a lot
api = '.'.join(version.split('.')[0:3])   # extract '1.2.3'

Is absolutely fine in python, there is no simplification possible. Only if you have to do it 1000s of times, consider using a library function or write your own.

Community
  • 1
  • 1
Wolf
  • 9,679
  • 7
  • 62
  • 108