I am writing a web crawler to learn go
My current implementation uses 10 go routines to get websites, I want to limit the number of times I can hit a hostname every second.
What is the best (thread-safe) approach to do this.
I am writing a web crawler to learn go
My current implementation uses 10 go routines to get websites, I want to limit the number of times I can hit a hostname every second.
What is the best (thread-safe) approach to do this.
A channel provides a concurrent synchronization mechanism you can use to coordinate with. You could use one in coordination with a time.Ticker
to periodically dispatch a given number of function calls.
// A PeriodicResource is a channel that is rebuffered periodically.
type PeriodicResource <-chan bool
// The NewPeriodicResourcePool provides a buffered channel that is filled after the
// given duration. The size of the channel is given as count. This provides
// a way of limiting an function to count times per duration.
func NewPeriodicResource(count int, reset time.Duration) PeriodicResource {
ticker := time.NewTicker(reset)
c := make(chan bool, count)
go func() {
for {
// Await the periodic timer
<-ticker.C
// Fill the buffer
for i := len(c); i < count; i++ {
c <- true
}
}
}()
return c
}
A single go routine waits for each ticker event and attempts to fill a buffered channel to max capacity. If a consumer does not deplete the buffer any successive tick only refills it. You can use the channel to synchronously perform an action at most n times per duration. For example, I may want to call doSomething()
no more than five times per second.
r := NewPeriodicResource(5, time.Second)
for {
// Attempt to deque from the PeriodicResource
<-r
// Each call is synchronously drawing from the periodic resource
doSomething()
}
Naturally, the same channel could be used to call go doSomething()
which would fan out at most five processes per second.