6

In Python, I have a string which is a comma separated list of values. e.g. '5,2,7,8,3,4'

I need to add a new value onto the end and remove the first value,

e.g. '5,22,7,814,3,4' -> '22,7,814,3,4,1'

Currently, I do this as follows:

mystr = '5,22,7,814,3,4'
latestValue='1'
mylist = mystr.split(',')
mystr = ''
for i in range(len(mylist)-1):
   if i==0:
      mystr += mylist[i+1]
   if i>0:
      mystr += ','+mylist[i+1]

mystr += ','+latestValue

This runs millions of times in my code and I've identified it as a bottleneck, so I'm keen to optimize it to make it run faster.

What is the most efficient to do this (in terms of runtime)?

Pete W
  • 1,817
  • 3
  • 18
  • 20
  • 2
    Why are you storing the list as a string instead of a list? – Brad Mace Nov 06 '10 at 03:20
  • Unfortunately, I have to use a string as I'm fetching the string from a database and then reinserting the new string. – Pete W Nov 06 '10 at 03:23
  • If this is truly your bottleneck then you may have run across the rare application where you really would benefit from rewriting in another language. In any language with mutable strings you could keep appending `,latestValue` and offsetting the pointer to the start, only really copying every Nth time (based on how long a string buffer you choose). – Ben Jackson Nov 06 '10 at 06:30
  • Thanks everyone for the solutions! Very impressive! – Pete W Nov 06 '10 at 17:47

6 Answers6

9

Use this:

if mystr == '':
    mystr = latestValue
else:
    mystr = mystr[mystr.find(",")+1:] + "," + latestValue

This should be much faster than any solution which splits the list. It only finds the first occurrence of , and "removes" the beginning of the string. Also, if the list is empty, then mystr will be just latestValue (insignificant overhead added by this) -- thanks Paulo Scardine for pointing that out.

Gabi Purcaru
  • 30,940
  • 9
  • 79
  • 95
6
mystr = mystr.partition(",")[2]+","+latestValue

improvement suggested by Paulo to work if mystr has < 2 elements.
In the case of 0 elements, it does extend mystr to hold one element.

_,_,mystr = (mystr+','+latestValue).partition(',')

$ python -m timeit -s "mystr = '5,22,7,814,3,4';latestValue='1'" "mystr[mystr.find(',')+1:]+','+latestValue"
1000000 loops, best of 3: 0.847 usec per loop
$ python -m timeit -s "mystr = '5,22,7,814,3,4';latestValue='1'" "mystr = mystr.partition(',')[2]+','+latestValue"
1000000 loops, best of 3: 0.703 usec per loop
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • +1 This is the fastest by a longshot and I did not know about `partition`. – aaronasterling Nov 06 '10 at 05:46
  • 1
    this code doesn't work for `len(mystr.split(",")) < 2` http://stackoverflow.com/questions/4111711/what-is-the-most-efficient-way-to-concatenate-two-strings-and-remove-everything-b/4112208#4112208 – jfs Nov 06 '10 at 09:45
  • Seems to work for len(mystr.split(",")) != 1, and works if you add the last element before: 'mystr+=','+latest; mystr = mystr.partition(",")[2]' – Paulo Scardine Nov 06 '10 at 17:53
  • @Paulo, and it can be made almost as fast by using tuple unpacking instead of indexing. – John La Rooy Nov 06 '10 at 23:02
6
_, sep, rest = mystr.partition(",")
mystr = rest + sep + latestValue

It also works without any changes if mystr is empty or a single item (without comma after it) due to str.partition returns empty sep if there is no sep in mystr.

You could use mystr.rstrip(",") before calling partition() if there might be a trailing comma in the mystr.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Very nice! If you *don't* have to handle the degenerate cases, this similar one-liner would work `mystr = ','.join([mystr.partition(",")[2],latestValue])`. – martineau Nov 06 '10 at 08:48
  • @martineau, `','.join()` is much less efficient than `mystr.partition(",")[2]+","+latestValue` – John La Rooy Nov 06 '10 at 09:56
  • @gnibbler: I don't see any difference in CPython (`.join()` is slower in jython, pypy) – jfs Nov 06 '10 at 10:15
  • @gnibbler: You're quite right. Somewhere along the line I was brainwashed into almost always avoiding string concatenation via '+' (and especially '+=') in my Python code -- and have fallen into the habit of doing so. Do you know of anything definitive on the subject (other than `timeit` tests ;-)? – martineau Nov 06 '10 at 10:27
  • @J.F. Sebastian: In timing tests I just ran with CPython 2.7, '+' was faster by nearly an order of magnitude (10x) when comparing `'a'+','+'b'` to `','.join(['a','b'])`. – martineau Nov 06 '10 at 10:36
  • @martineau, string concatenation has been improved in the (fairly common) case where there is only one reference to the string. See PyString_ConcatAndDel in http://svn.python.org/view/python/trunk/Objects/stringobject.c?view=markup – John La Rooy Nov 06 '10 at 11:30
  • @J.F. Sebastian: Yes they are, as it is with statistics. FWIW if I change the pair generator to `yield 'a'*i, 'bb'` in your benchmark, which I think is closer to what the OP might encounter, '+' is about 33% faster. That's a far cry from 10x, but still significant enough to consider when something will be done millions of times in code you're is trying to optimize. All of this has little to do with your answer which uses '+' and is clearly the best one because of its speed and elegant handling of the special cases. – martineau Nov 06 '10 at 20:34
  • This is still 10% slower than my answer (at least for strings of the size given in the question) – John La Rooy Nov 06 '10 at 23:06
  • @martineau: I've tried `'a'*i, `bb` variant the difference is `~5 %` not `33 %`. – jfs Nov 07 '10 at 07:29
  • @J.F. timeit as per my answer. The extra time is taken by binding+looking up the extra variables – John La Rooy Nov 07 '10 at 10:33
  • @J.F. Sebastian: Using a combination of the `cProfile` and `pstats` modules on 10 trials, I got an average percentage difference of the cumulative time in each method (`join_join` and `join_plus`) of slightly less than 7%, and a ratio of the two of ~1.32, which is why I said about 33% faster. – martineau Nov 08 '10 at 19:15
4

best version: gnibbler's answer


Since you need speed (millions of times is a lot), I profiled. This one is about twice as fast as splitting the list:

i = 0
while 1:
    if mystr[i] == ',': break
    i += 1
mystr = mystr[i+1:] + ', ' + latest_value

It assumes that there is one space after each comma. If that's a problem, you can use:

i = 0
while 1:
    if mystr[i] == ',': break
    i += 1
mystr = mystr[i+1:].strip() + ', ' + latest_value

which is only slightly slower than the original but much more robust. It's really up to you to decide how much speed you need to squeeze out of it. They both assume that there will be a comma in the string and will raise an IndexError if one fails to appear. The safe version is:

i = 0
while 1:
    try:
        if mystr[i] == ',': break
    except IndexError:
        i = -1
        break
    i += 1
mystr = mystr[i+1:].strip() + ', ' + latest_value

Again, this is still significantly faster than than splitting the string but does add robustness at the cost of speed.


Here's the timeit results. You can see that the fourth method is noticeably faster than the third (most robust) method, but slightly slower than the first two methods. It's the fastest of the two robust solutions though so unless you are sure that your strings will have commas in them (i.e. it would already be considered an error if they didn't) then I would use it anyway.

$ python -mtimeit -s'from strings import tests, method1' 'method1(tests[0], "10")' 1000000 loops, best of 3: 1.34 usec per loop

$ python -mtimeit -s'from strings import tests, method2' 'method2(tests[0], "10")' 1000000 loops, best of 3: 1.34 usec per loop

$ python -mtimeit -s'from strings import tests, method3' 'method3(tests[0], "10")' 1000000 loops, best of 3: 1.5 usec per loop

$ python -mtimeit -s'from strings import tests, method4' 'method4(tests[0], "10")' 1000000 loops, best of 3: 1.38 usec per loop


$ python -mtimeit -s'from strings import tests, method5' 'method5(tests[0], "10")' 100000 loops, best of 3: 1.18 usec per loop

This is gnibbler's answer

Community
  • 1
  • 1
aaronasterling
  • 68,820
  • 20
  • 127
  • 125
2
mylist = mystr.split(',')
mylist.append(latestValue);
mystr = ",".join(mylist[1:])

String concatenation in python isn't very efficient (since strings are immutable). It's easier to work with them as lists (and more efficient). Basically in your code you are copying your string over and over again each time you concatenate to it.

GWW
  • 43,129
  • 11
  • 115
  • 108
  • 1
    Don't forget that he also needs to remove the head of the list. What he really needs is a queue: http://docs.python.org/library/queue.html – Brian Clements Nov 06 '10 at 03:27
  • For further reading about performance of string vs. list concatenation, please take a look at ["What is the most efficient string concatenation method in python?"](http://stackoverflow.com/questions/1316887/what-is-the-most-efficient-string-concatenation-method-in-python). – rsenna Nov 06 '10 at 03:29
  • 1
    I remove the head of the list by using mylist[1:] in the join statement – GWW Nov 06 '10 at 03:39
  • I'm actually surprised that there's a difference between them. Perhaps the added step of converting the latestValue to a list before concatenating the lists is slower than a straight up append. – GWW Nov 06 '10 at 04:00
  • @GWW: right in the nail, "mystr=','.join(mystr.split(',')[1:])+','+latestValue" performs the same. – Paulo Scardine Nov 06 '10 at 04:13
  • @Paulo, I saw that I just don't understand why, it's not like our solutions are buggy and non-functional. – GWW Nov 06 '10 at 04:40
  • For the record, all three of my solutions beat this one in terms of speed. It's hardly the second best. Two of them beat the best answer for speed as well. – aaronasterling Nov 06 '10 at 04:47
  • @aaronsterling: I like your answer especially with the comparisons. – GWW Nov 06 '10 at 04:57
  • list has no advantage over string in this application as removing the first element is O(n) for the list – John La Rooy Nov 06 '10 at 05:18
  • If you look at my code I didn't remove the first element from the list. I used mylist[1:], which behaves as a generator. It's still faster than copying a string over and over again. However, may not be as fast as other solutions. – GWW Nov 06 '10 at 05:20
  • @GWW, `mylist[1:]` makes a new list by doing a shallow copy of the items of `mylist`. A generator version could use `itertools.islice` – John La Rooy Nov 06 '10 at 06:40
2

Edited: Not the best, but I love one-liners. :-)

mystr = ','.join(mystr.split(',')[1:]+[latestValue])

Before testing I would bet it would perform better.

> python -m timeit "mystr = '5,22,7,814,3,4'" "latestValue='1'" \
"mylist = mystr.split(',')" "mylist.append(latestValue);" \
"mystr = ','.join(mylist[1:])"
1000000 loops, best of 3: 1.37 usec per loop
> python -m timeit "mystr = '5,22,7,814,3,4'" "latestValue='1'"\
"','.join(mystr.split(',')[1:]+[latestValue])"
1000000 loops, best of 3: 1.5 usec per loop
> python -m timeit "mystr = '5,22,7,814,3,4'" "latestValue='1'"\
'mystr=mystr[mystr.find(",")+1:]+","+latestValue'
1000000 loops, best of 3: 0.625 usec per loop
Paulo Scardine
  • 73,447
  • 11
  • 124
  • 153