I have a collection of 512D std::vector to store face embeddings. I create my index and perform training on a subset of the data.
int d = 512;
size_t nb = this->templates.size() // 95000
size_t nt = 50000; // training data size
std::vector<float> training_set(nt * d);
faiss::IndexFlatIP coarse_quantizer(d);
int ncentroids = int(4 * sqrt(nb)));
faiss::IndexIVFPQ index(&coarse_quantizer,d,ncentroids,4,8);
std::vector<float> training_set(nt*d);
The this->templates has an index value in [0] and the 512D vectors in [1]. My question is about the training and indexing. I have this currently:
int v=0;
for (auto const& element : this->templates)
{
std::vector<double> enrollment_template = element.second;
for (int i=0;i<d;i++){
training_set[(v*d)+i] = (float)enrollment_template.at(i);
v++;
}
index.train(nt,training_set.data());
FAISS Index.Train function
virtual void train(idx_t n, const float *x)
Perform training on a representative set of vectors
Parameters:
n – nb of training vectors
x – training vecors, size n * d
Is that the proper way of adding the 512D vector data into Faiss for training? It seems to me that if I have 2 face embeddings that are 512D in size, the training_set would be like this:
training_set[0-511] - Face #1's 512D vectors training_set[512-1024] - Face #2's 512D vectors
and since Faiss knows we are working with 512D vectors, it will intelligently parse them out of the array.