0

I'm looking at using Go to write a small program that's mostly handling text. I'm pretty sure, based on what I've heard about Go and Python that Go will be substantially faster. I don't actually have a specific need for insane speeds, but I'd like to get to know Go.

The "Go is going to be faster" idea was supported by a trivial test:

# test.py
print("Hello world")
$ time python dummy.py
Hello world

real    0m0.029s
user    0m0.019s
sys 0m0.010s

// test.go
package main

import "fmt"

func main() {

    fmt.Println("hello world")
}
$ time ./test
hello world

real    0m0.001s
user    0m0.001s
sys 0m0.000s

Looks good in terms of raw startup speed (which is entirely expected). Highly non-scientific justification:

$ strace python test.py 2>&1 | wc -l
1223
$ strace ./test 2>&1 | wc -l
174

However, my next contrived test was how fast is Go when faffing with strings, and I was expecting to be similarly blown away by Go's raw speed. So, this was surprising:

# test2.py
s = ""

for i in range(1000000):
    s += "a"
$ time python test2.py
real    0m0.179s
user    0m0.145s
sys 0m0.013s

// test2.go
package main

func main() {

    s := ""

    for i:= 0; i < 1000000; i++ {
        s += "a";
    }
}
$ time ./test2
real    0m56.840s
user    1m50.836s
sys 0m17.653

So Go is hundreds of times slower than Python.

Now, I know this is probably due to Schlemiel the Painter's algorithm, which explains why the Go implementation is quadratic in i (i is 10 times bigger leads to 100 times slowdown).

However, the Python implementation seems much faster: 10 times more loops only slows it down by twice. The same effect persists if you concatenate str(i), so I doubt there's some kind of magical JIT optimization to s = 100000 * 'a' going on. And it's not much slower if I print(s) at the end, so the variable isn't being optimised out.

Naivety of the concatenation methods aside (there are surely more idiomatic ways in each language), is there something here that I have misunderstood, or is it simply easier in Go than in Python to run into cases where you have to deal with C/C++-style algorithmic issues when handling strings (in which case a straight Go port might not be as uh-may-zing as I might hope without having to, ya'know, think about things and do my homework)?

Or have I run into a case where Python happens to work well, but falls apart under more complex use?

Versions used: Python 3.8.2, Go 1.14.2

Inductiveload
  • 6,094
  • 4
  • 29
  • 55
  • 2
    Take a look at [this question](https://stackoverflow.com/questions/1760757/how-to-efficiently-concatenate-strings-in-go) for details on more efficient string concatenation in Go. This type of benchmark tends to rely mostly on the allocation strategy for the underlying array(s). – Marc Apr 30 '20 at 09:00
  • Also Python's way should be: `''.join('a' for _ in range(n))` – norok2 Apr 30 '20 at 09:03
  • 3
    Go has counted strings (strings are implemented as a two-element header with length and data). The real heart of the problem is an implementation detail: Go never expands a string "in place" as it does not have a ref-counting garbage collector; Python over-allocates on purpose (to handle this kind of dumb thing) and does have a reference-counting garbage collector and can see that the refcount on the underlying string is 1 and expand the string in place. – torek Apr 30 '20 at 10:11
  • 4
    In essence, what you've demonstrated is that if you invoke the Python memory allocator roughly 20 times (log2(1000000)), it's slower than invoking the Go memory allocator roughly 1000000 times. – torek Apr 30 '20 at 10:14
  • @torek that's the perfect answer to the question and reveals an important implementation detail that could be critical if a Go port is to be of any benefit over a "naive" Python program. – Inductiveload Apr 30 '20 at 10:31

2 Answers2

5

TL;DR summary: basically you're testing the two implementation's allocators / garbage collectors and heavily weighting the scale on the Python side (by chance, as it were, but this is something the Python folks optimized at some point).

To expand my comments into a real answer:

  • Both Go and Python have counted strings, i.e., strings are implemented as a two-element header thingy containing a length (byte count or, for Python 3 strings, Unicode characters count) and data pointer.

  • Both Go and Python are garbage-collected (GCed) languages. That is, in both languages, you can allocate memory without having to worry about freeing it yourself: the system takes care of that automatically.

  • But the underlying implementations differ, quite a bit in this particular one important way: the version of Python you are using has a reference counting GC. The Go system you are using does not.

With a reference count, the inner bits of the Python string handler can do this. I'll express it as Go (or at least pseudo-Go) although the actual Python implementation is in C and I have not made all the details line up properly:

// add (append) new string t to existing string s
func add_to_string(s, t string_header) string_header {
    need = s.len + t.len
    if s.refcount == 1 { // can modify string in-place
        data = s.data
        if cap(data) >= need {
            copy_into(data + s.len, t.data, t.len)
            return s
        }
    }
    // s is shared or s.cap < need

    new_s := make_new_string(roundup(need))
    // important: new_s has extra space for the next call to add_to_string

    copy_into(new_s.data, s.data, s.len)
    copy_into(new_s.data + s.len, t.data, t.len)
    s.refcount--
    if s.refcount == 0 {
        gc_release_string(s)
    }
    return new_s
}

By over-allocating—rounding up the need value so that cap(new_s) is large—we get about log2(n) calls to the allocator, where n is the number of times you do s += "a". With n being 1000000 (one million), that's about 20 times that we actually have to invoke the make_new_string function and release (for gc purposes because the collector uses refcounts as a first pass) the old string s.

[Edit: your source archaeology led to commit 2c9c7a5f33d, which suggests less than doubling but still a multiplicative increase. To other readers, see comment.]

The current Go implementation allocates strings without a separate capacity header field (see reflect.StringHeader and note the big caveat that says "don't depend on this, it might be different in future implementations"). Between the lack of a refcount—we can't tell in the runtime routine that adds two strings, that the target has only one reference—and the inability to observe the equivalent of cap(s) (or cap(s.data)), the Go runtime has to create a new string every time. That's one million memory allocations.

To show that the Python code really does use the refcount, take your original Python:

s = ""

for i in range(1000000):
    s += "a"

and add a second variable t like this:

s = ""
t = s

for i in range(1000000):
    s += "a"
    t = s

The difference in execution time is impressive:

$ time python test2.py
        0.68 real         0.65 user         0.03 sys
$ time python test3.py
       34.60 real        34.08 user         0.51 sys

The modified Python program still beats Go (1.13.5) on this same system:

$ time ./test2
       67.32 real       103.27 user        13.60 sys

and I have not poked any further into the details, but I suspect the Go GC is running more aggressively than the Python one. The Go GC is very different internally, requiring write barriers and occasional "stop the world" behavior (of all goroutines that are not doing the GC work). The refcounting nature of the Python GC allows it to never stop: even with a refcount of 2, the refcount on t drops to 1 and then next assignment to t drops it to zero, releasing the memory block for re-use in the next trip through the main loop. So it's probably picking up the same memory block over and over again.

(If my memory is correct, Python's "over-allocate strings and check the refcount to allow expand-in-place" trick was not in all versions of Python. It may have first been added around Python 2.4 or so. This memory is extremely vague and a quick Google search did not turn up any evidence one way or the other. [Edit: Python 2.7.4, apparently.])

torek
  • 448,244
  • 59
  • 642
  • 775
  • That's a really helpful insight. I was inspired to do some digging into cpython: I looks like `PyByteArray_Resize` will over-allocate up to 12.5% of the length at a time, otherwise it allocates the exact amount. This looks to have come in in 2008 (`2c9c7a5f33d`, in Python 2.7.4rc1) when `stringobject.c` was renamed to `bytearrayobject.c`. Assuming that's how it works, log1.125(1e6) is about 117 allocations. – Inductiveload Apr 30 '20 at 16:46
  • Ah, so deliberate over-allocation was a relatively late 2.7 addition, then. I was pretty sure it wasn't always this way. – torek Apr 30 '20 at 17:04
2

Well. You should never, ever use string concatenation in this way :-)

in go, try the strings.Buider

package main

import (
 "strings"
)

func main() {

    var b1 strings.Builder

    for i:= 0; i < 1000000; i++ {
        b1.WriteString("a");
    }
}
Herbert Poul
  • 4,512
  • 2
  • 31
  • 48
  • well, this is nice but an explanation (and a link) for why "you should never use this" would make it much better! – riyaz-ali Apr 30 '20 at 10:15
  • strings.Builder should be used because it is much faster for the use case of appending strings repeatedly. – Alexander Sep 28 '21 at 07:50