Calculate Eucledian Distance Between an Array with Arrays Database

Question

I want to make a feature matching algorithm which works by comparing detected object embedding with embeddings inside a database then search for the lowest distance. an embedding consist of 16 float32 values.

Currently there are 500 thousand embeddings in a database which all need to be compared with an input embedding each time a particular object detected (ex : car)

My current method is to load all database data to an Array in python and do iteration for each embedding to calculate lowest eucledian distance. It take tremendously long time if using loop (up to 1 second). I need about 30ms

Here is my code :

from sqlitedict import SqliteDict
import numpy as np

dictCarDb = SqliteDict('./carDatabase.db', autocommit=True)

"""
dictCarDb Dictionary Format :
dictCarDb = {
    "Car1" : [embedding1, embedding2, embedding3, ...],
    "Car2" : [embedding1, embedding2, embedding3, ...],
    "Car3" : [embedding1, embedding2, embedding3, ...],
    .....................
}

CONVERTED_DATA List Format :
CONVERTED_DATA = [
    ["Car1", [embedding1, embedding2, embedding3, ...]],
    ["Car2", [embedding1, embedding2, embedding3, ...]],
    ["Car3", [embedding1, embedding2, embedding3, ...]],
    .....................
]

embeddingInput List Format :
embeddingInput = [embedding1, embedding2, embedding3, ...]
"""

CONVERTED_DATA = []
for carName in dictCarDb:
    embedding = dictCarDb[carName]
    CONVERTED_DATA.append([carName, embedding])

def calculateDistance(embedding1, embedding2):
    dist = np.linalg.norm(embedding1 - embedding2)
    return dist

def searchCarFromArray(embeddingInput):
    smallestVal = ["Uknown", 1000]
    for carNameDb, carEmbeddingDb in CONVERTED_DATA:
        euclidVal = calculateDistance(carEmbeddingDb, embeddingInput)
        if euclidVal < smallestVal[1]:
            smallestVal = [carNameDb, euclidVal]
    if smallestVal[1] < 0.5:
        return smallestVal[0]
    return "Unknown"

=================== UPDATE (22/2/2022) ===================

I've tried suggestions from @DominiqueGarmier and @WillemHendriks below and find out that faiss (https://github.com/facebookresearch/faiss) is the solution.

below code can execute my goal in less than 30ms (it was 1 second if using python "for" loop) :

import faiss
.......
.......
def searchCarFromArray(embeddingInput):
    dimensionEmb = 16
    CONVERTED_DATA_NP = np.array(CONVERTED_DATA, dtype=object)
    embeddingsdB = CONVERTED_DATA_NP[:, 1].tolist()
    embeddingsdB_np = np.array(embeddingsdB, dtype=np.float32)
    embeddingInput_np = np.array(embeddingInput, dtype=np.float32)
    embeddingInput_np = embeddingInput_np.reshape(1, -1)
    index = faiss.IndexFlatL2(dimensionEmb)
    index.add(embeddingsdB_np)
    D, I = index.search(embeddingInput_np, 1)
    indexArr = I[0][0]
    carName = CONVERTED_DATA_NP[indexArr][0]
    carEmb = CONVERTED_DATA_NP[indexArr][1]
    euclidVal = D[0][0]
    if euclidVal < 0.5:
        return carName
    return "Unknown"

perhaps take a look at this post? it talks about finding the closest point in a list of points, which I think is what you are trying to do. https://stackoverflow.com/a/10818976/11578308 — Dominique Garmier, Feb 22 '22 at 09:43
https://github.com/facebookresearch/faiss There are specialized packages for this, like FAISS — Willem Hendriks, Feb 22 '22 at 10:39
@DominiqueGarmier Hi, thank you. I've tried it and got 63ms. I will post the result soon if I cant find faster solutions. — Halo Gass, Feb 22 '22 at 10:43
You could also tree BallTree, make sure to tweak the leafsize, as this can have (minor) impact. But if you fight against 50ms it might just do the trick. — Willem Hendriks, Feb 22 '22 at 12:09
@WillemHendriks Thank you, I tried faiss and finally got 30ms, god bless you — Halo Gass, Feb 22 '22 at 13:52

Calculate Eucledian Distance Between an Array with Arrays Database

0 Answers0