2

As far as I understood from this article, you can find nearest neighbors using <-> distance operator when working with geometric data types:

SELECT name, location --location is point
FROM geonames
ORDER BY location <-> '(29.9691,-95.6972)'
LIMIT 5;

You can also get some optimizations using SP-GiST indexes:

CREATE INDEX idx_spgist_geonames_location ON geonames USING spgist(location);

But I can't find anything about using <-> operator with arrays in the documentation. If I were to perform same queries using double precision[] instead of point, for example, would that work?

keddad
  • 1,398
  • 3
  • 14
  • 35

2 Answers2

0

Apparently, we can't. For example, I've got a simple table:

CREATE TABLE test (
  id SERIAL PRIMARY KEY,
  loc double precision[]
);

And I want to query documents from it, ordering by distance,

SELECT loc FROM test ORDER BY loc <-> ARRAY[0, 0, 0, 0]::double precision[];

It doesn't work:

 Query Error: error: operator does not exist: double precision[] <-> double precision[] 

Documentation has no mention of <-> for arrays as well. I found a workaround in accepted answer for this question, but it poses some limitations, especially on array's length. Although there is an article (written in Russian), which suggests a workaround on array size limitation. Creation of sample table:

import postgresql

def setup_db():
    db = postgresql.open('pq://user:pass@localhost:5434/db')
    db.execute("create extension if not exists cube;")
    db.execute("drop table if exists vectors")
    db.execute("create table vectors (id serial, file varchar, vec_low cube, vec_high cube);")
    db.execute("create index vectors_vec_idx on vectors (vec_low, vec_high);")

Element insertion:

query = "INSERT INTO vectors (file, vec_low, vec_high) VALUES ('{}', CUBE(array[{}]), CUBE(array[{}]))".format(
            file_name,
            ','.join(str(s) for s in encodings[0][0:64]),
            ','.join(str(s) for s in encodings[0][64:128]),
        )
db.execute(query)

Element querying:

import time
import postgresql
import random

db = postgresql.open('pq://user:pass@localhost:5434/db')

for i in range(100):
    t = time.time()
    encodings = [random.random() for i in range(128)]

    threshold = 0.6
    query = "SELECT file FROM vectors WHERE sqrt(power(CUBE(array[{}]) <-> vec_low, 2) + power(CUBE(array[{}]) <-> vec_high, 2)) <= {} ".format(
        ','.join(str(s) for s in encodings[0:64]),
        ','.join(str(s) for s in encodings[64:128]),
        threshold,
    ) + \
            "ORDER BY sqrt(power(CUBE(array[{}]) <-> vec_low, 2) + power(CUBE(array[{}]) <-> vec_high, 2)) ASC LIMIT 1".format(
                ','.join(str(s) for s in encodings[0:64]),
                ','.join(str(s) for s in encodings[64:128]),
            )
    print(db.query(query))
    print('inset time', time.time() - t, 'ind', i)
keddad
  • 1,398
  • 3
  • 14
  • 35
0

In the article linked in the OP, the author seems to be using PostGIS, an extension that does define a <-> operator to return the 2D distance between two geometries.

Documentation of the <-> operator: https://postgis.net/docs/geometry_distance_knn.html

swimmer
  • 1,971
  • 2
  • 17
  • 28