12

I'm stress testing (with loader.io) this type of code in Go to create an array of 100 items along with some other basic variables and parse them all in the template:

package main

import (
    "html/template"
    "net/http"
)

var templates map[string]*template.Template

// Load templates on program initialisation
func init() {
    if templates == nil {
        templates = make(map[string]*template.Template)
    }

    templates["index.html"] = template.Must(template.ParseFiles("index.html"))
}

func handler(w http.ResponseWriter, r *http.Request) {
    type Post struct {
        Id int
        Title, Content string
    }

    var Posts [100]Post

    // Fill posts
    for i := 0; i < 100; i++ {
        Posts[i] = Post{i, "Sample Title", "Lorem Ipsum Dolor Sit Amet"}
    }

    type Page struct {
        Title, Subtitle string
        Posts [100]Post
    }

    var p Page

    p.Title = "Index Page of My Super Blog"
    p.Subtitle = "A blog about everything"
    p.Posts = Posts

    tmpl := templates["index.html"]

    tmpl.ExecuteTemplate(w, "index.html", p)
}

func main() {
    http.HandleFunc("/", handler)
    http.ListenAndServe(":8888", nil)
}

My test with Loader is using 5k concurrent connections/s through 1 minute. The problem is, just after a few seconds after starting the test, I get a high average latency (almost 10s) and as a result 5k successful responses and the test stops because it reaches the Error Rate of 50% (timeouts).

On the same machine, PHP gives 50k+.

I understand that it's not Go performance issue, but probably something related to html/template. Go can easily manage hard enough calculations a lot faster than anything like PHP of course, but when it comes to parsing the data to the template, why is it so awful?

Any workarounds, or probably I'm just doing it wrong (I'm new to Go)?

P.S. Actually even with 1 item it's exactly the same... 5-6k and stopping after huge amount of timeouts. But that's probably because the array with posts is staying of the same length.

My template code (index.html):

{{ .Title }}
{{ .Subtitle }}

{{ range .Posts }}
        {{ .Title }}
        {{ .Content }}
{{ end }}

Here's the profiling result of github.com/pkg/profile:

root@Test:~# go tool pprof app /tmp/profile311243501/cpu.pprof
Possible precedence issue with control flow operator at /usr/lib/go/pkg/tool/linux_amd64/pprof line 3008.
Welcome to pprof!  For help, type 'help'.
(pprof) top10
Total: 2054 samples
      97   4.7%   4.7%      726  35.3% reflect.Value.call
      89   4.3%   9.1%      278  13.5% runtime.mallocgc
      85   4.1%  13.2%       86   4.2% syscall.Syscall
      66   3.2%  16.4%       75   3.7% runtime.MSpan_Sweep
      58   2.8%  19.2%     1842  89.7% text/template.(*state).walk
      54   2.6%  21.9%      928  45.2% text/template.(*state).evalCall
      51   2.5%  24.3%       53   2.6% settype
      47   2.3%  26.6%       47   2.3% runtime.stringiter2
      44   2.1%  28.8%      149   7.3% runtime.makeslice
      40   1.9%  30.7%      223  10.9% text/template.(*state).evalField

These are profiling results after refining the code (as suggested in the answer by icza):

root@Test:~# go tool pprof app /tmp/profile501566907/cpu.pprof
Possible precedence issue with control flow operator at /usr/lib/go/pkg/tool/linux_amd64/pprof line 3008.
Welcome to pprof!  For help, type 'help'.
(pprof) top10
Total: 2811 samples
     137   4.9%   4.9%      442  15.7% runtime.mallocgc
     126   4.5%   9.4%      999  35.5% reflect.Value.call
     113   4.0%  13.4%      115   4.1% syscall.Syscall
     110   3.9%  17.3%      122   4.3% runtime.MSpan_Sweep
     102   3.6%  20.9%     2561  91.1% text/template.(*state).walk
      74   2.6%  23.6%      337  12.0% text/template.(*state).evalField
      68   2.4%  26.0%       72   2.6% settype
      66   2.3%  28.3%     1279  45.5% text/template.(*state).evalCall
      65   2.3%  30.6%      226   8.0% runtime.makeslice
      57   2.0%  32.7%       57   2.0% runtime.stringiter2
(pprof)
uiwe83
  • 163
  • 1
  • 10
  • Can you check if Go 1.5 beta 1 (http://golang.org/dl/#go1.5beta1) changes anything? (in case garbage collector speed is an issue here) – VonC Jul 11 '15 at 20:58
  • Yes, I did. The results are the same. – uiwe83 Jul 11 '15 at 21:12
  • 1
    Look at profiling using https://github.com/pkg/profile and posting the top10 from the CPU profile. It'd also help to see whether your template does any work on the input? I'm assuming it just iterates over the input array. – elithrar Jul 12 '15 at 02:06
  • 1
    Passing a pointer to .Execute should speed it up a bit. – OneOfOne Jul 12 '15 at 02:42
  • @elithrar how do I read the file? I got /tmp/profile*****/cpu.pprof in the end, but it doesn't seem to be readable... – uiwe83 Jul 12 '15 at 04:06
  • @uiwe83 `go tool pprof $YOURBINARY cpu.pprof` and then type `top10` at the prompt and post the results in your question. I suspect a lot of makeSlice calls. – elithrar Jul 12 '15 at 04:09
  • How do you run your PHP app? – kostya Jul 14 '15 at 02:35
  • Also given that profiler shows 35.5% time spent in `reflect.Value.call` it would be interesting to see how the performance changes when `Post` and `Page` structs are replaces with maps (like in your PHP example). – kostya Jul 14 '15 at 02:40
  • @kostya I suppose you've already seen my code from reddit post so... ;) Yes I will definitely try to replace with maps, the reason I went for structs is just because I first noticed them and from the start liked to build them. – uiwe83 Jul 14 '15 at 06:40
  • @uiwe83, yes, I saw your reddit post, though I don't have an account there ;) Using structs would be my preference too as the code is cleaner and safer. Using structs is also much faster if all properties are resolved at compile time. I was just curious how much difference in performance would maps make. – kostya Jul 14 '15 at 08:24
  • @kostya I realize the PHP code is quite simple, tried to make it as similar on Go as possible and ended up with this code using maps (though in PHP I'm using array of arrays and they're not even associative, yet I couldn't recreate it in Go): http://pastebin.com/1ckqEDYZ - benchmarking it gives roughly 5k successful results, just like in my first try - even with text/template. – uiwe83 Jul 14 '15 at 23:51

6 Answers6

12

There are two main reasons why the equivalent application using html/template is slower than PHP variant.

First of all html/template provides more functionality than the PHP. The main difference is that html/template will automatically escape variables using correct escaping rules (HTML, JS, CSS, etc) depending on their location in the resulting HTML output (which I think is quite cool!).

Secondly html/template rendering code heavily uses reflection and methods with variable number of arguments and they are just not as fast as statically compiled code.

Under the hood the following template

{{ .Title }}
{{ .Subtitle }}

{{ range .Posts }}
    {{ .Title }}
    {{ .Content }}
{{ end }}

is converted to something like

{{ .Title | html_template_htmlescaper }}
{{ .Subtitle | html_template_htmlescaper }}

{{ range .Posts }}
    {{ .Title | html_template_htmlescaper }}
    {{ .Content | html_template_htmlescaper }}
{{ end }}

Calling html_template_htmlescaper using reflection in a loop kills performance.

Having said all that this micro-benchmark of html/template shouldn't be used to decide whether to use Go or not. Once you add code to work with the database to the request handler I suspect that template rendering time will hardly be noticeable.

Also I am pretty sure that over time both Go reflection and html/template package will become faster.

If in a real application you will find that html/template is a bottleneck it still possible to switch to text/template and supply it with already escaped data.

kostya
  • 9,221
  • 1
  • 29
  • 36
  • Thanks, I think both yours and @icza answers are great, though yours answers the question more clearly. One thing I left wondering - is it appropriate to try to use goroutines / channels and would it help if still using html/template? – uiwe83 Jul 14 '15 at 06:43
  • `http.ListenAndServer` creates one goroutine per request so rendering the same template for different connections happens in parallel. – kostya Jul 14 '15 at 06:56
  • I mean, will it be of any use if I use them for inside loops (where I create them, fill with data etc.)? And if so, would you provide an example based on my code? Sorry for the question off topic and asking too much! Right now I'm struggling with it but don't want to create a new question regarding this. This is my code so far: http://pastebin.com/9NBQ9xY8 – uiwe83 Jul 14 '15 at 07:35
  • I don't think it is possible to speed this up by using more goroutines. Think about it this way. Goroutines can only make some code faster if the code can be parallelized and executed on multiple cores simultaneously. In this case template rendering algorithm cannot be parallelized but more importantly all processor cores are already busy rendering templates for other connections and do not have capacity to do anything else. Generally if an app is IO bound adding more goroutines might help, if an app is CPU bound adding more goroutines might even slow the app down. – kostya Jul 14 '15 at 08:19
  • Interesting, I see. Interesting short overview btw, I've upgraded test machine to 2 cores and 4 gb of RAM, loaded with the same stress testing and resulted with 300 of 300k possible responses on php, while only twice more (I was using max procs) - around 120k, with Go. Go app is loading both cores on 100% each while processing... I'm not sure why... – uiwe83 Jul 14 '15 at 15:31
10

You are working with arrays and structs, both which are non-pointer types, nor are they descriptors (like slices or maps or channels). So passing them always creates a copy of the value, assigning an array value to a variable copies all the elements. This is slow and gives a huge amount of work to the GC.


Also you are utilizing only 1 CPU core. To utilize more, add this to your main() function:

func main() {
    runtime.GOMAXPROCS(runtime.NumCPU())
    http.HandleFunc("/", handler)
    log.Fatal(http.ListenAndServe(":8888", nil))
}

Edit: This was only the case prior to Go 1.5. Since Go 1.5 runtime.NumCPU() is the default.


Your code

var Posts [100]Post

An array with space for 100 Posts is allocated.

Posts[i] = Post{i, "Sample Title", "Lorem Ipsum Dolor Sit Amet"}

You create a Post value with a composite literal, then this value is copied into the ith element in the array. (redundant)

var p Page

This creates a variable of type Page. It is a struct, so its memory is allocated which also contains a field Posts [100]Post so another array of 100 elements is allocated.

p.Posts = Posts

This copies 100 elements (a hundred structs)!

tmpl.ExecuteTemplate(w, "index.html", p)

This creates a copy of p (which is of type Page), so another array of 100 posts is created and elements from p are copied, then it is passed to ExecuteTemplate().

And since Page.Posts is an array, most likely when it is processed (iterated over in the template engine), a copy will be made from each element (haven't checked - not verified).

Proposal for a more efficient code

Some things to speed up your code:

func handler(w http.ResponseWriter, r *http.Request) {
    type Post struct {
        Id int
        Title, Content string
    }

    Posts := make([]*Post, 100) // A slice of pointers

    // Fill posts
    for i := range Posts {
        // Initialize pointers: just copies the address of the created struct value
        Posts[i]= &Post{i, "Sample Title", "Lorem Ipsum Dolor Sit Amet"}
    }

    type Page struct {
        Title, Subtitle string
        Posts []*Post // "Just" a slice type (it's a descriptor)
    }

    // Create a page, only the Posts slice descriptor is copied
    p := Page{"Index Page of My Super Blog", "A blog about everything", Posts}

    tmpl := templates["index.html"]

    // Only pass the address of p
    // Although since Page.Posts is now just a slice, passing by value would also be OK 
    tmpl.ExecuteTemplate(w, "index.html", &p)
}

Please test this code and report back your results.

icza
  • 389,944
  • 63
  • 907
  • 827
  • Thank you. This is very interesting (proper management of pointers, addresses and so on) and makes a lot of sense to me. Surprisingly, I've tested this code multiple times once again and still stuck with 5-6k barrier. I will update the question with the results of profiling tool. I'm utilizing only one core on my test machine (as it has only 1), but still using runtime.GOMAXPROCS. – uiwe83 Jul 12 '15 at 12:26
  • `Posts := make([]Post, 100)` (a slice of Posts) might be more efficient in this case. – kostya Jul 13 '15 at 01:09
  • 1
    @kostya It might. I was hesitating which to choose, because filling the array seems faster if elements are pointers, as in case of non-pointers, structs will be copied. Also when processing the slice, a simple `for range` would also copy each element into the "loop variable". Needs benchmarking. – icza Jul 13 '15 at 06:02
  • In go-nuts I've been advised to put variable creation and filling into init(), though that still didn't resolve a problem with ExecuteTemplate. The only way that for now really improves the performance (155k hits) is caching a template in bytes buffer at program init, but that means the data loaded on user's request is always the same. – uiwe83 Jul 13 '15 at 11:53
  • It's great that Go has a library for working with templates, looks like if I use some similar templating package in PHP (with those fancy security features), the results might be the same, not sure. Still in many cases I don't think I need them, because there won't be any user-generated content, why would I lose so much performance on simple {{ range }} of my blog posts. – uiwe83 Jul 13 '15 at 12:17
  • Can you do a quick test replacing `html/template` with `text/template` to measure performance penalty of HTML escaping? – kostya Jul 14 '15 at 02:31
  • @kostya yes I did, the performance improved greatly to the level of PHP, though I know this costs me losing benefits of automatic escaping. – uiwe83 Jul 14 '15 at 06:38
1

PHP isn't answering 5000 requests concurrently. The requests are being multiplexed to a handful of processes for serial execution. This makes more efficient use of both CPU and memory. 5000 concurrent connections may make sense for a message broker or similar, doing limited processing of small pieces of data, but it makes little sense for any service doing real I/O or processing. If your Go app is not behind a proxy of some type that will limit the number of concurrent requests, you will want to do so yourself, perhaps at the beginning of your handler, using a buffered channel or a wait group, a la https://blakemesdag.com/blog/2014/11/12/limiting-go-concurrency/.

1

html/template is slow because it uses reflection, which isn't optimized for speed yet.

Try quicktemplate as a workaround of slow html/template. Currently quicktemplate is more than 20x faster than html/template according to the benchmark from its' source code.

valyala
  • 11,669
  • 1
  • 59
  • 62
0

There is a template benchmark you can check at goTemplateBenchmark. Personally, I think Hero is the one that best combines efficiency and readability.

João Wiciuk
  • 170
  • 2
  • 11
0

Typed strings is your friend if you would like to speed up html/template. It is sometimes useful to pre-render repeating HTML-fragments.

Assuming that most of the time is spent to render those 100 Post objects, it could make sense to pre-render those.

Lars Christian Jensen
  • 1,407
  • 1
  • 13
  • 14