0

I'm writing a simple caching mechanism which has an Add an Evict and a Search method. The Search is currently not implemented yet, so there's no need to worry about that.

There's a relatively large number of goroutines that call Add to add data and there's only one which runs in an evict loop to evict data. As soon as I put some serious traffic on it Go throws up saying there's a concurrent read and write access on the map metricCache, but I can't see how that can happen because there are locks around it. I'm using Go 1.7.

File mdata/cache.go:

57: func NewCCache() *CCache {
58:     cc := &CCache{
59:         lock:        sync.RWMutex{},
60:         metricCache: make(map[string]*CCacheMetric),
61:         accnt:       accnt.NewFlatAccnt(maxSize),
62:     }
63:     go cc.evictLoop()
64:     return cc
65: }
66:
67: func (c *CCache) evictLoop() {
68:     evictQ := c.accnt.GetEvictQ()
69:     for target := range evictQ {
70:         c.evict(target)
71:     }
72: }
73: 
74: func (c *CCache) Add(metric string, prev uint32, itergen chunk.IterGen) {
75:     c.lock.Lock()
76: 
77:     if ccm, ok := c.metricCache[metric]; !ok {
78:         var ccm *CCacheMetric
79:         ccm = NewCCacheMetric()
80:         ccm.Init(prev, itergen)
81:         c.metricCache[metric] = ccm
82:     } else {
83:         ccm.Add(prev, itergen)
84:     }
85:     c.lock.Unlock()
86: 
87:     c.accnt.AddChunk(metric, itergen.Ts(), itergen.Size())
88: }
89: 
90: func (c *CCache) evict(target *accnt.EvictTarget) {
91:     c.lock.Lock()
92: 
93:     if _, ok := c.metricCache[target.Metric]; ok {
94:         log.Debug("cache: evicting chunk %d on metric %s\n", target.Ts, target.Metric)
95:         length := c.metricCache[target.Metric].Del(target.Ts)
96:         if length == 0 {
97:             delete(c.metricCache, target.Metric)
98:         }
99:     }
100: 
101:     c.lock.Unlock()
102: }

That's the error message:

metrictank_1    | fatal error: concurrent map read and map write
metrictank_1    | 
metrictank_1    | goroutine 3159 [running]:
metrictank_1    | runtime.throw(0xaade7e, 0x21)
metrictank_1    |       /usr/local/go/src/runtime/panic.go:566 +0x95 fp=0xc4216a7eb8 sp=0xc4216a7e98
metrictank_1    | runtime.mapaccess2_faststr(0x9e22c0, 0xc42031e600, 0xc4210c2b10, 0x22, 0x28, 0xa585d5496)
metrictank_1    |       /usr/local/go/src/runtime/hashmap_fast.go:306 +0x52b fp=0xc4216a7f18 sp=0xc4216a7eb8
metrictank_1    | github.com/raintank/metrictank/mdata/cache.(*CCache).Add(0xc4202fa070, 0xc4210c2b10, 0x22, 0x0, 0xc421875f82, 0x25, 0x25, 0xa585d5496)
metrictank_1    |       /home/mst/go/src/github.com/raintank/metrictank/mdata/cache/cache.go:77 +0x63 fp=0xc4216a7f80 sp=0xc4216a7f18
metrictank_1    | runtime.goexit()
metrictank_1    |       /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc4216a7f88 sp=0xc4216a7f80
metrictank_1    | created by github.com/raintank/metrictank/api.(*Server).getSeries
metrictank_1    |       /home/mst/go/src/github.com/raintank/metrictank/api/dataprocessor.go:442 +0x122b

UPDATE: I recompiled with -race and now I'm getting a different error. This looks as if the RWMutex were completely ineffective because according to the backtraces the problem must be in the combination of the evict and Add methods.

==================
WARNING: DATA RACE
Read at 0x00c4201c81e0 by goroutine 215:
  runtime.mapaccess2_faststr()
      /usr/local/go/src/runtime/hashmap_fast.go:297 +0x0
  github.com/raintank/metrictank/mdata/cache.(*CCache).Add()
      /home/mst/go/src/github.com/raintank/metrictank/mdata/cache/cache.go:77 +0xaa

Previous write at 0x00c4201c81e0 by goroutine 155:
  runtime.mapdelete()
      /usr/local/go/src/runtime/hashmap.go:558 +0x0
  github.com/raintank/metrictank/mdata/cache.(*CCache).evict()
      /home/mst/go/src/github.com/raintank/metrictank/mdata/cache/cache.go:97 +0x30e
  github.com/raintank/metrictank/mdata/cache.(*CCache).evictLoop()
      /home/mst/go/src/github.com/raintank/metrictank/mdata/cache/cache.go:70 +0xb3

Goroutine 215 (running) created at:
  github.com/raintank/metrictank/api.(*Server).getSeries()
      /home/mst/go/src/github.com/raintank/metrictank/api/dataprocessor.go:442 +0x17c9
  github.com/raintank/metrictank/api.(*Server).getTarget()
      /home/mst/go/src/github.com/raintank/metrictank/api/dataprocessor.go:331 +0x9c3
  github.com/raintank/metrictank/api.(*Server).getTargetsLocal.func1()
      /home/mst/go/src/github.com/raintank/metrictank/api/dataprocessor.go:284 +0xa9

Goroutine 155 (running) created at:
  github.com/raintank/metrictank/mdata/cache.NewCCache()
      /home/mst/go/src/github.com/raintank/metrictank/mdata/cache/cache.go:63 +0x12f
  main.main()
      /home/mst/go/src/github.com/raintank/metrictank/metrictank.go:388 +0x246c
==================
replay
  • 3,569
  • 3
  • 21
  • 30
  • Please include where exactly the race detector shows the read and write access. – JimB Dec 23 '16 at 17:05
  • 1
    Yeah, you built with the `-race` flag right? With that Go will tell you right where the problem is. Without it, you may be hitting some sort of worst-case corrupted memory detection error. – Zan Lynx Dec 23 '16 at 17:06
  • From this code I don't see any obvious problems. Except that this can't be all of the code, or why did you use a RWLock as only a simple Mutex? This code never read locks it. I suspect a goroutine or other structure has and is using a pointer to your map. – Zan Lynx Dec 23 '16 at 17:21
  • It might be something sneaky like passing a CCache object into a goroutine or a defer function through a closure, so you don't see how it got outside of a lock on first sight. – Zan Lynx Dec 23 '16 at 17:23
  • http://stackoverflow.com/a/36167319/13422 – Zan Lynx Dec 23 '16 at 17:27
  • 1
    It does look like a problem with the lock. One thing that can cause this is if you **ever, anywhere, made a COPY of that sync.RWMutex**. Don't ever do that. – Zan Lynx Dec 23 '16 at 18:18
  • 1
    You may want to try declaring `lock` as a pointer to a RWMutex which could solve a lock copy problem. – Zan Lynx Dec 23 '16 at 18:20
  • @ZanLynx you were totally right, I created a copy of the lock – replay Dec 23 '16 at 18:23

1 Answers1

2

A colleague of mine has found the answer:

After calling NewCCache() I copied the returned variable by value (including the lock) and then called Add() on the copy, at the same time the evictLoop() go routine was still referring to the old copy. So they were operating on different copies of the lock :)

replay
  • 3,569
  • 3
  • 21
  • 30