Goal: I'd like to distribute a couple of billions of points into bins. Bins should be flushed to disk to keep memory footprint sane.
My attempt: Whenever a bin reaches a threshold, e.g. 1 million points, I'd like to spawn a thread that writes the points to disk. Data is written to one file per bin; Multiple threads can be spawned for different bins, but at most one thread per bin. I'm doing this by checking a bool named "flushing". If a bin starts beeing flushed, it's set to true in the main thread, and back to false by the write thread.
Question: Will this cause threading issues? My assumption is that there shouldn't be an issue since "flushing" can only become true when the thread has already done it's job and a new thread is allowed to spawn. It's okay if the bins become larger than 1 million points in the meantime.
struct Bin{
vector<Point> points;
bool flushing = false;
}
vector<Bin> bins;
void add(Point point){
int index = computeBinIndex(point);
Bin& bin = bins[index];
bin.points.push_back(point);
// only start flushing if bin.flushing == false
if(bin.points.size() > 1'000'000 && bin.flushing == false){
flush(bin);
}
}
void flush(Bin& bin){
vector<Point> points = std::move(bin.points);
bin.points = vector<Point>();
bin.flushing = true;
thread t([points, bin](){
saveToDisk(points);
// we're done, set bin.flushing back to false
bin.flushing = false;
});
t.detach();
}