2

My objective is to take a map[string]int containing potentially up to a million entries and chunk it in sizes of up to 500 and POST the map to an external service. I'm newer to golang, so I'm tinkering in the Go Playground for now.

Any tips anyone has on how to improve the efficiency of my code base, please share!

Playground: https://play.golang.org/p/eJ4_Pd9X91c

The CLI output I'm seeing is:

original size 60
chunk bookends 0 20
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,
chunk bookends 20 40
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,
chunk bookends 40 60
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,

The problem here is that while the chunk bookends are being calculated correctly, the x value is starting at 0 each time. I think I should expect it to start at the chunk bookend minimum, which would be 0, 20, 40, etc. How come the range is starting at zero each time?

Source:

package main

import (
    "fmt"
    "math/rand"
    "strconv"
)

func main() {
    items := make(map[string]int)

    // Generate some fake data for our testing, in reality this could be 1m entries
    for i := 0; i < 60; i ++ {
        // int as strings are intentional here
        items[strconv.FormatInt(int64(rand.Int()), 10)] = rand.Int()
    }

    // Create a map of just keys so we can easily chunk based on the numeric keys
    i := 0
    keys := make([]string, len(items))
    for k := range items {
            keys[i] = k
            i++
    }

    fmt.Println("original size", len(keys))
    //batchContents := make(map[string]int)

    // Iterate numbers in the size batch we're looking for
    chunkSize := 20
    for chunkStart := 0; chunkStart < len(keys); chunkStart += chunkSize {
        chunkEnd := chunkStart + chunkSize

        if chunkEnd > len(items) {  
            chunkEnd = len(items)
        }

        // Iterate over the keys
        fmt.Println("chunk bookends", chunkStart, chunkEnd)
        for x := range keys[chunkStart:chunkEnd] {
            fmt.Print(x, ",")

            // Build the batch contents with the contents needed from items
            // @todo is there a more efficient approach?
            //batchContents[keys[i]] = items[keys[i]]
        }
        fmt.Println()

        // @todo POST final batch contents
        //fmt.Println(batchContents)
    }

}
icza
  • 389,944
  • 63
  • 907
  • 827
Ben
  • 60,438
  • 111
  • 314
  • 488

1 Answers1

6

When you process a chunk:

for x := range keys[chunkStart:chunkEnd] {}

You are iterating over a slice, and having one iteration variable, it will be the slice index, not the element from the slice (at the given index). Hence it will always start at 0. (When you iterate over a map, first iteration variable is the key because there is no index there, and the second is the value associated with that key.)

Instead you want this:

for _, key := range keys[chunkStart:chunkEnd] {}

Also note that it's redundant to first collect the keys in a slice, and then process them. You may do that when iterating over the map once, at first. Just keep a variable counting the iterations to know when you reach the chunk size, which may be implicit if you use data structures that keeps this (e.g. the size of a keys batch slice).

For example (try it on the Go Playground):

chunkSize := 20
batchKeys := make([]string, 0, chunkSize)
process := func() {
    fmt.Println("Batch keys:", batchKeys)
    batchKeys = batchKeys[:0]
}

for k := range items {
    batchKeys = append(batchKeys, k)
    if len(batchKeys) == chunkSize {
        process()
    }
}
// Process last, potentially incomplete batch
if len(batchKeys) > 0 {
    process()
}
icza
  • 389,944
  • 63
  • 907
  • 827
  • :facepalm: yes, `for _, key` makes complete sense. Thanks for pointing that out. I'm also super grateful for this more efficient approach. So `[:0]` is basically selecting no batch keys, thus returning an empty map, effectively clearing it out? – Ben Sep 10 '19 at 23:28
  • Bonus question: If it takes 30 seconds for `for k := range items {` to process these items, but throughout that 30 seconds the length of `items` changes and some of the `int` values are changed, will any errors be encountered? I'm fine not picking up on those updates... – Ben Sep 10 '19 at 23:35
  • @Webnet The batch keys are gathered in a _slice_ called `batchKeys`. `batchKeys = batchKeys[:0]` means to reslice this slice having a new length of `0`. This has the effect of "clearing" the slice (throwing away the previous batch's keys), and has the benefit of keeping the backing array, so it can be reused, no reallocation is required, no garbage is generated. – icza Sep 11 '19 at 07:19
  • @Webnet Bonus answer: if you add entries to the map while iterating over, that's not a problem, but the new elements may or may not be visited by the loop, see [How to add into map in range loop](https://stackoverflow.com/questions/55399243/how-to-add-into-map-in-range-loop/55399337#55399337). – icza Sep 11 '19 at 07:22
  • ... This applies if you add elements in the same loop. If you add elements in another goroutine, you must synchronize access to the map, else that's a race condition and the runtime will intentionally crash your app, see [How to recover from concurrent map writes?](https://stackoverflow.com/questions/39288741/how-to-recover-from-concurrent-map-writes/39289246#39289246) – icza Sep 11 '19 at 07:23
  • 1
    ... See this question how to apply proper synchronization if you need to add elements to a map in another goroutine while you iterate over it: [Concurrent access to maps with 'range' in Go](https://stackoverflow.com/questions/40442846/concurrent-access-to-maps-with-range-in-go/40456170#40456170). Although the easiest is to block modifications while you iterate over it (but that may be a problem to you if looping over it takes a long time). – icza Sep 11 '19 at 07:24