FAISS with C++ indexing 512D vectors

Question

I have a collection of 512D std::vector to store face embeddings. I create my index and perform training on a subset of the data.

int d = 512;

size_t nb = this->templates.size()  // 95000

size_t nt = 50000; // training data size

std::vector<float> training_set(nt * d);

faiss::IndexFlatIP coarse_quantizer(d);

int ncentroids = int(4 * sqrt(nb)));

faiss::IndexIVFPQ index(&coarse_quantizer,d,ncentroids,4,8);

std::vector<float> training_set(nt*d);

The this->templates has an index value in [0] and the 512D vectors in [1]. My question is about the training and indexing. I have this currently:

int v=0;
for (auto const& element : this->templates)
{
   std::vector<double> enrollment_template = element.second;
   for (int i=0;i<d;i++){
     training_set[(v*d)+i] = (float)enrollment_template.at(i);

     v++;
}

index.train(nt,training_set.data());

FAISS Index.Train function

virtual void train(idx_t n, const float *x)
Perform training on a representative set of vectors

Parameters:
n – nb of training vectors

x – training vecors, size n * d

Is that the proper way of adding the 512D vector data into Faiss for training? It seems to me that if I have 2 face embeddings that are 512D in size, the training_set would be like this:

training_set[0-511] - Face #1's 512D vectors training_set[512-1024] - Face #2's 512D vectors

and since Faiss knows we are working with 512D vectors, it will intelligently parse them out of the array.

Is it a typo in your question that `templates.size()` does not match `nt`? It seems like they should be the same. — John Zwinck, Sep 18 '22 at 07:57
Hey John. With Faiss they don't recommend training on the entire database.. so nt is a subset of the data used for the initial training purposes — Brian, Sep 18 '22 at 08:02
OK then `training_set[(v*d)+i]` will cause memory corruption because you're writing up to `templates.size()` rows but `training_set` only has space for `nt`. — John Zwinck, Sep 18 '22 at 13:56
Very true.. and I have a check in there to cap the training set that I didn't include in this code snippet.. but the primary question really is about how Faiss accepts Vectors and if one face embedding is added via training_set[0-511] or if there is a different way. — Brian, Sep 18 '22 at 14:13

John Zwinck · Answer 1 · 2022-09-18T14:03:40.857

Here's a more efficient way to write it:

int v = 0;
for (auto const& element : this->templates)
{
    auto& enrollment_template = element.second; // not copy
    if (v + d > training_set.size()) {
        break; // prevent overflow, "nt" is smaller than templates.size()
    }
    for (int i = 0; i < d; i++) {
        training_set[v] = enrollment_template[i]; // not at()
        v++;
    }
}

We avoid a copy with auto& enrollment_template, avoid extra branching with enrollment_template[i] (you know you won't be out of bounds), and simplify the address computation with training_set[v] by making v a count of elements rather than rows.

Further efficiency could be gained if templates can be changed to store floats rather than doubles--then you'd just be bitwise-copying 512 floats rather than converting doubles to floats.

Also, be sure to declare d as constexpr to give the compiler the best chance of optimizing the loop.

Thanks John. I should have listed this in the original code (I updated it), but training_set is a std::vector .. and enrollment_template is an array of vectors w/ 512D. That's what is throwing me off.. It seems odd that I would create 512 individual floats for each vector in the array — Brian, Sep 18 '22 at 08:16

FAISS with C++ indexing 512D vectors

1 Answers1