61

Python has a built in function sum, which is effectively equivalent to:

def sum2(iterable, start=0):
    return start + reduce(operator.add, iterable)

for all types of parameters except strings. It works for numbers and lists, for example:

 sum([1,2,3], 0) = sum2([1,2,3],0) = 6    #Note: 0 is the default value for start, but I include it for clarity
 sum({888:1}, 0) = sum2({888:1},0) = 888

Why were strings specially left out?

 sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
 sum2(['foo','bar'], '') = 'foobar'

I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.

Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.

Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?

Muhammad Alkarouri
  • 23,884
  • 19
  • 66
  • 101
  • Because it does not make sense to "sum" strings. – NullUserException Aug 19 '10 at 19:22
  • 19
    @NullUserException: it makes as much sense to "sum" strings as it is to "sum" lists. – Muhammad Alkarouri Aug 19 '10 at 19:31
  • 1
    @Muhammad Alkarouri: Summing a sequence sums the elements of the sequence. What are the elements of a string? Individual characters, which are -- still -- not numbers. You can't sum them. You could "concatenate" them, but -- interestingly -- they're already concatenated into the string. – S.Lott Aug 19 '10 at 19:52
  • 2
    @NullUserException: It would be great if you were right, but sadly the **+** operation on strings is already overloaded to mean concatentation. So with **+** we already construct string "sums". – u0b34a0f6ae Aug 19 '10 at 19:57
  • @kaizer I know that; I think this owes to the fact that most languages uses `+` as the concatenation operator. Still doesn't make sense to sum strings. – NullUserException Aug 19 '10 at 20:03
  • 3
    @S.Lott: I meant summing a sequence of lists compared to summing a sequence of strings. As it happens, "sum" of a list of lists concatenates the lists. You can sum two lists using `+` to concatenate them. You can sum two strings using `+` to concatenate them. So it makes as much sense to define sum as concatenation for strings as it is for lists. That is what I meant. Whether this is good or is bad is beside the question. – Muhammad Alkarouri Aug 19 '10 at 20:12
  • @Muhammad Alkarouri: "I meant summing a sequence of lists compared to summing a sequence of strings." That's doesn't seem to be clear in your question. – S.Lott Aug 19 '10 at 20:33
  • 2
    @S.Lott: read my question again. It is quite clear there. I said: "for all types of parameters except strings. It works for numbers and lists, for example." Which means that numbers and lists are parameters in much the same way strings are. How did you understand the comparison between `sum` and `"".join`? – Muhammad Alkarouri Aug 19 '10 at 20:57
  • @Muhammad Alkarouri: In your opinion it's quite clear. You wrote it. Of course you think it's clear. However, you have at least one person to whom it was not clear. You can argue with me about things that confused me, but it won't help. I **was** confused. I **did** ask. You **can** fix it or you can ignore me. But it doesn't make sense to argue too much. `sum("abc")` and `sum([1,2,3])` is what I understood you to be asking about. You can claim it's clear all you want, but it actually confused me and I actually asked. – S.Lott Aug 19 '10 at 21:25
  • 1
    @S.Lott: Of course I cannot objectively say what I write is clear, if there hadn't been multiple answers from people understanding the question correctly. I assumed that people aware with `sum` know about the string special case. Anyway, I have explained in that comment you replied to, and I have edited the question. Thanks for your input. – Muhammad Alkarouri Aug 19 '10 at 22:04
  • @Muhammad Alkarouri: " I assumed that people aware with sum know about the string special case." Evidence that you're smarter than some people reading your questions. Don't argue with people who are confused. Explain to people who are confused. – S.Lott Aug 20 '10 at 02:45
  • 1
    @S.Lott: I appreciate your advice. On the other hand, I didn't mean it as evidence that I am smarter than some people. I meant that people who care about this question either know in advance or are willing to put in the effort to lookup Python [sum](http://docs.python.org/library/functions.html#sum) which mentions this immediately. So it's not about being smart, it is about due diligence. I will, honestly, have your advice in mind for future situations. Thanks again. – Muhammad Alkarouri Aug 20 '10 at 03:08
  • @Muhammad Alkarouri: Here's what you're refusing to consider. Your question was -- clearly -- about `sum("abc")` and `sum([1,2,3])` and nothing more. You can claim that I *should* have done some mysterious *due diligence* to determine that your question wasn't trivially bad. I read a lot of bad SO questions. There is no magical *due diligence* for separating bad questions from good. I can only ask. You can revise your question to make it clear. When **one** person bothers to ask for clarification you have to take that as evidence that something's wrong with the question. – S.Lott Aug 20 '10 at 10:07
  • 3
    @S.Lott Not to beat a dead horse, but I read the question and understood it instantly. And on a more technical level, characters in a Python string are just strings themselves, you can technically /can/ sum the characters, resulting in ordinary concatenation. (`','.join('foo')`, for example, returns `'f,o,o'`.) – javawizard Dec 04 '12 at 01:34
  • I think dan04's answer, efficiency is the key here. – Dima Tisnek Apr 28 '14 at 16:48
  • Actually, `sum2` should be more like `return reduce(operator.add, iterable, start)`. [`reduce`](https://docs.python.org/2/library/functions.html#reduce) can also take an optional `start` parameter, and if omitted, unlike `sum`, would raise an Exception when given an empty sequence. – tobias_k Feb 06 '17 at 09:59

8 Answers8

49

Python tries to discourage you from "summing" strings. You're supposed to join them:

"".join(list_of_strings)

It's a lot faster, and uses much less memory.

A quick benchmark:

$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop

Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.

BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)

rbp
  • 43,594
  • 3
  • 38
  • 31
  • The benchmark is useful. Your answer would be more complete if it compares the `reduce` vs `sum` for lists, which gives me around the same result. So the argument for strings would work only for strings. – Muhammad Alkarouri Aug 19 '10 at 19:38
  • 1
    Actually, I think it's the other way around: `reduce` and `sum` are similar for lists, because they do more or less the same thing. That's why I used this benchmark: `reduce`, representing `sum`, versus `join`, which is an optimised way of "adding" strings. – rbp Aug 19 '10 at 19:54
  • 1
    I am feeling old now. I understand "forever" in Python as before 2.0. Things done after the introduction of new style classes are not that "forever". – Muhammad Alkarouri Aug 19 '10 at 19:55
  • Hehe True :) I meant in terms of `sum`'s lifetime, r32347's commit message is "Adding new built-in function sum, with docs and tests." – rbp Aug 19 '10 at 20:00
  • 1
    It does seem "unpythonic" to discourage this usage. I agree with Muhammad that it's a strange exception. But it seems this is the best reason available. – Edmund Aug 20 '10 at 05:53
  • 2
    I think it's there because of the chronological sequence of implementations: when they implemented the `sum` builtin, they already had `join` for strings. So, to avoid people unknowingly (or knowingly, but lazily/mischievously) using `sum` for strings, they specifically disallowed it. Since there wasn't a specific "sum" for lists (or other types), they made the exception only for strings. Nowadays, I think they'd keep it this way, even if someone came up with a specific "sum", for backwards compatibility. – rbp Aug 23 '10 at 00:46
  • 1
    It's not optimizing for a common case. Optimization would be using that same if statement blocking it to instead invoke the (already written) `str.join` code. Since that would be such an easy optimization to make, I'd assume the intent is to deliberate fail noisily to teach people about the existence of `str.join`. – Ben Apr 30 '14 at 21:41
  • @Ben you're right, that's not an optimisation at all. I'm not sure why I wrote that :) – rbp May 01 '14 at 12:14
  • Makes you wonder why `sum` does not just delegate to `''.join` when `start` is a string, though, particularly since it already seems to have a special treatment for strings to produce that particular error message. – tobias_k Feb 06 '17 at 10:03
27

You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..

>>> class ZeroObject(object):
...  def __add__(self, other):
...   return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
u0b34a0f6ae
  • 48,117
  • 14
  • 92
  • 101
17

Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup

In the builtin_sum function we have this bit of code:

     /* reject string values for 'start' parameter */
        if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
            PyErr_SetString(PyExc_TypeError,
                "sum() can't sum strings [use ''.join(seq) instead]");
            Py_DECREF(iter);
            return NULL;
        }
        Py_INCREF(result);
    }

So.. that's your answer.

It's explicitly checked in the code and rejected.

HS.
  • 15,442
  • 8
  • 42
  • 48
  • 5
    It's interesting to see the code, but the question was "*Why* aren't strings summed", not "*How* did they exclude strings from summing?" ... – Edmund Aug 20 '10 at 05:47
  • 4
    Well, I meant, the reason why its not working is because its explicitly banned in the code. Others seemed to answer by explaining why you shouldn't do it. – HS. Aug 20 '10 at 07:35
  • Thanks for sharing the source code. But I get a different traceback when I try to sum two strings. `sum(['foo', 'bar'])` -> `TypeError: unsupported operand type(s) for +: 'int' and 'str'`. But `sum(['foo', 'bar'], '')` -> `TypeError: sum() can't sum strings [use ''.join(seq) instead]`. Any ideas why the first one returns a different traceback? I think it's more likely to occur that way. – Bill Oct 21 '18 at 19:24
  • I just noticed the answer to my above comment is explained by @u0b34a0f6ae below. I now realise the `start=0` argument is the initial value for the summation process. – Bill Oct 21 '18 at 19:28
14

From the docs:

The preferred, fast way to concatenate a sequence of strings is by calling ''.join(sequence).

By making sum refuse to operate on strings, Python has encouraged you to use the correct method.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 1
    So it's part of the 'one way to do it' philosophy. (As they could just have made sum treat strings specially) – Thomas Ahle Feb 13 '14 at 13:43
  • 3
    @ThomasAhle: There is a computational reason why `sum` is the wrong tool for the job. `sum` builds the result incrementally. Doing so with strings requires (potentially many) temporary immutable strings. If done naively, this process requires lots of temporary memory and lots of copying. In contrast, if done with `''.join()`, the appropriate amount of memory is allocated from the beginning, and result is built right there at once. See [Martelli's exhortation](http://stackoverflow.com/questions/1349311/python-string-join-is-faster-than-but-whats-wrong-here/1350289#1350289). – unutbu Feb 13 '14 at 13:56
  • 2
    Sure, but you can do lots of inefficient things in Python, and often it doesn't matter. I can't think of any other cases than 'sum of strings' where there are tests against doing something, hard coded in the Python C code! – Thomas Ahle Feb 13 '14 at 14:39
11

Short answer: Efficiency.

Long answer: The sum function has to create an object for each partial sum.

Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.

doubles are always the same size, which makes sum's running time O(1)×N = O(N).

int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).

For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).

Thus, summing strings would be much slower than summing numbers.

str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.

dan04
  • 87,747
  • 23
  • 163
  • 198
  • 4
    That argument is wrong because the same would hold for lists and tuples, which can be summed. As stated by HS, Python explicitly check for strings, and *only* for strings, which just doesn't make sense. – Philipp Aug 20 '10 at 06:08
  • 4
    @Philipp: The missing piece is that many, many people were trying to use `sum` for strings, and not many use `sum` for lists and tuples. The trap is that `sum` works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with `sum`. – Ethan Furman Apr 28 '14 at 20:01
  • 1
    Excellent answer. Except for "would be much slower" has to be "is much slower" as has been proven. – steffen Nov 25 '16 at 06:19
10

The Reason Why

@dan04 has an excellent explanation for the costs of using sum on large lists of strings.

The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.

Ethan Furman
  • 63,992
  • 20
  • 159
  • 237
  • 9
    Is there any reason why the `sum()` implementation doesn't just call `''.join()` when a string is encountered in `sum()` instead of returning an error instructing you to call `''.join()`? –  Apr 28 '14 at 20:13
  • @Slater: Yes. But I don't recall what it was. :( – Ethan Furman Apr 28 '14 at 22:06
4

Edit: Moved the parts about immutability to history.

Basically, its a question of preallocation. When you use a statement such as

sum(["a", "b", "c", ..., ])

and expect it to work similar to a reduce statement, the code generated looks something like

v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")

In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.

A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.

Debilski
  • 66,976
  • 12
  • 110
  • 133
  • 1
    It's true, but that's not the whole story. Integers are immutable as well. But the join operation on strings is specialised, takes the whole list into account, and, therefore, is much faster. – rbp Aug 19 '10 at 19:30
  • Yeah, ok maybe immutability is not the real problem here. (Though I can imagine that register-sized integers suffer less from immutability on the Python-side than strings do.) But then I think that preallocation actually *is* the whole story. – Debilski Aug 19 '10 at 19:37
  • 1
    @Debiliski: The preallocation was probably the missing link for me; it tells why `"",join` behaves so much better than a generic sum. I had to look at the [code](http://svn.python.org/projects/python/trunk/Objects/stringobject.c) to understand. I think the immutability is a red herring. – Muhammad Alkarouri Aug 19 '10 at 20:22
  • @Debiliski: Unfortunately I have accepted the other answer, else I would have suggested that you edit yours to make the preallocation and explanation more prominent. – Muhammad Alkarouri Aug 19 '10 at 20:23
  • Does it actually pre-allocate? Join's argument can be an arbitrary iterator, which you might only be able to traverse once. Without being familiar with the code, I'd guess the key point is at the C level you've got an over-allocated array which you can mutate in place; since it hasn't been released to python code yet this doesn't break the "python strings are immutable" law. The repeated allocations/copying can be amortized to O(n) if you always increase the size by at least double when you need more (rather than doing it for every string). – Ben Apr 30 '14 at 21:53
3

I dont know why, but this works!

import operator
def sum_of_strings(list_of_strings):
    return reduce(operator.add, list_of_strings)