I'm writing a search engine in Go in which I have an inverted index of words to the corresponding results for each word. There is a set dictionary of words and so the words are already converted into a StemID
, which is an integer starting from 0. This allows me to use a slice of pointers (i.e. a sparse array
) to map each StemID
to the structure which contains the results of that query. E.g. var StemID_to_Index []*resultStruct
. If aardvark
is 0
then the pointer to the resultStruct for aardvark
is located at StemID_to_Index[0]
, which will be nil
if the result for this word is currently not loaded.
There is not enough memory on the server to store all of this in memory, so the structure for each StemID
will be saved as separate files and these can be loaded into the StemID_to_Index
slice. If StemID_to_Index
is currently nil
for this StemID
then the result is not cached and needs to be loaded, otherwise it's already loaded (cached) and so can be used directly. Each time a new result is loaded the memory usage is checked and if it's over the threshold then 2/3 of the loaded results are thrown away (StemID_to_Index
is set to nil
for these StemIDs and a garbage collection is forced.)
My problem is the concurrency. What is the fastest and most efficient way in which I can have multiple threads searching at the same time without having problems with different threads trying to read and write to the same place at the same time? I'm trying to avoid using mutexes on everything as that would slow down every single access attempt.
Do you think I would get away with loading the results from disk in the working thread and then delivering the pointer to this structure to an "updater" thread using channels, which then updates the nil
value in the StemID_to_Index
slice to the pointer of the loaded result? This would mean that two threads would never attempt to write at the same time, but what would happen if another thread tried to read from that exact index of StemID_to_Index
while the "updater" thread was updating the pointer? It doesn't matter if a thread is given a nil
pointer for a result which is currently being loaded, because it will just be loaded twice and while that is a waste of resources it would still deliver the same result and since that is unlikely to happen very often, it's forgiveable.
Additionally, how would the working thread which send the pointer to be updated to the "updater" thread know when the "updater" thread has finished updating the pointer in the slice? Should it just sleep and keep checking, or is there an easy way for the updater to send a message back to the specific thread which pushed to the channel?
UPDATE
I made a little test script to see what would happen if attempting to access a pointer at the same time as modifying it... it seems to always be OK. No errors. Am I missing something?
package main
import (
"fmt"
"sync"
)
type tester struct {
a uint
}
var things *tester
func updater() {
var a uint
for {
what := new(tester)
what.a = a
things = what
a++
}
}
func test() {
var t *tester
for {
t = things
if t != nil {
if t.a < 0 {
fmt.Println(`Error1`)
}
} else {
fmt.Println(`Error2`)
}
}
}
func main() {
var wg sync.WaitGroup
things = new(tester)
go test()
go test()
go test()
go test()
go test()
go test()
go updater()
go test()
go test()
go test()
go test()
go test()
wg.Add(1)
wg.Wait()
}
UPDATE 2
Taking this further, even if I read and write from multiple threads to the same variable at the same time... it makes no difference, still no errors:
From above:
func test() {
var a uint
var t *tester
for {
t = things
if t != nil {
if t.a < 0 {
fmt.Println(`Error1`)
}
} else {
fmt.Println(`Error2`)
}
what := new(tester)
what.a = a
things = what
a++
}
}
This implies I don't have to worry about concurrency at all... again: am I missing something here?