1

I have an app where users can sign up and fill out a profile. This profile consists of 16 questions that can be answered using a slider. Each "answer" for a question can be between -3 and 3 (or 0 and 7).

A user should be able to find similar users based on the results of the questions. I thought using a vector database like Weaviate or Pinecone could help me find these matches on demand, but unfortunately if I do simple experiments the similarity mostly 0.

Here is what I am doing in Pinecone:

Indexing:

const index = await initIndex()

const vectors = [
  {
    id: '1',
    values: [-3, -3, -3, -3, -3]
  },
  {
    id: '2',
    values: [-1, -1, -1, -1, -1]
  },
  {
    id: '3',
    values: [0, 0, 0, 0, 0]
  },
  {
    id: '4',
    values: [1, 1, 1, 1, 1]
  },
  {
    id: '5',
    values: [3, 3, 3, 3, 3]
  }
] as Vector[]

const upsertRequest: UpsertRequest = {
  vectors
}

await index.upsert({
  upsertRequest,
})

Searching:

const index = await initIndex()

const queryRequest = {
  topK: 10,
  vector: [0, 0, 0, 0, 0],
  includeValues: true
}

const queryResponse = await index.query({ queryRequest })

Result:

{
    "queryResponse": {
        "results": [],
        "matches": [
            {
                "id": "2",
                "score": 0,
                "values": [
                    -1,
                    -1,
                    -1,
                    -1,
                    -1
                ]
            },
            {
                "id": "1",
                "score": 0,
                "values": [
                    -3,
                    -3,
                    -3,
                    -3,
                    -3
                ]
            },
            {
                "id": "3",
                "score": 0,
                "values": [
                    0,
                    0,
                    0,
                    0,
                    0
                ]
            },
            {
                "id": "5",
                "score": 0,
                "values": [
                    3,
                    3,
                    3,
                    3,
                    3
                ]
            },
            {
                "id": "4",
                "score": 0,
                "values": [
                    1,
                    1,
                    1,
                    1,
                    1
                ]
            }
        ],
        "namespace": ""
    }
}

Why is the score always 0? Shouldn't it be .5 based on the vectors in my database?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Andre Zimpel
  • 2,323
  • 4
  • 27
  • 42
  • You wouldn't by chance happen to be using a cosine-based index, would you? If not, which index type are you using? – Aaron Aug 29 '23 at 17:59
  • 1
    Yea I was using cosine but that seems not to work since it only takes the direction into consideration and not the magnitude. So I opted for the euclidean metric which works. – Andre Zimpel Aug 30 '23 at 06:27

1 Answers1

0

So it took a little bit of work, but I actually did manage to reproduce this. I created a COSINE-based index and added the data that you mentioned above. I then queried by a vector which matched ID#3:

{'id': '1', 'score': 0.0, 'values': [-3.0, -3.0, -3.0, -3.0, -3.0]},
{'id': '5', 'score': 0.0, 'values': [3.0, 3.0, 3.0, 3.0, 3.0]},
{'id': '3', 'score': 0.0, 'values': [0.0, 0.0, 0.0, 0.0, 0.0]},
{'id': '2', 'score': 0.0, 'values': [-1.0, -1.0, -1.0, -1.0, -1.0]},
{'id': '4', 'score': 0.0, 'values': [1.0, 1.0, 1.0, 1.0, 1.0]}],
 'namespace': ''}

Being a DataStax employee, I tried this on Astra DB, next:

CREATE TABLE users (
    user_id INT PRIMARY KEY,
    survey_vector VECTOR<Float,5>);

CREATE CUSTOM INDEX users ON users(survey_vector) USING 'StorageAttachedIndex';

INSERT INTO users (user_id, survey_vector) VALUES (1,[-3, -3, -3, -3, -3]);
INSERT INTO users (user_id, survey_vector) VALUES (2,[-1, -1, -1, -1, -1]);
INSERT INTO users (user_id, survey_vector) VALUES (3,[0, 0, 0, 0, 0]);
INSERT INTO users (user_id, survey_vector) VALUES (4,[1, 1, 1, 1, 1]);
INSERT INTO users (user_id, survey_vector) VALUES (5,[3, 3, 3, 3, 3]);

It failed on the INSERT, where id=3.

WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed - received 0 responses and 3 failures: UNKNOWN from 10.16.22.38:7000, UNKNOWN from 10.16.12.4:7000, UNKNOWN from 10.16.8.4:7000" info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0, 'failures': 3}

Astra DB threw a similar error when I tried an ANN query.

TL;DR;

You can't run a cosine-based vector search with a vector full of zeros (aka: null vector), because that results in a divide-by-zero error. Astra DB correctly threw an error (a consistency error, but an error nonetheless).

Pinecone hides it. Not sure if it's silently failing, but it still gives you back the results. Although, it can't do anything about the score, so that's why they're all zeros.

Anyway, a search on a null vector does work with a Euclidean index/search. Recreate your index as "EUCLIDEAN," because you can have a null vector with that:

Pinecone with a Euclidean index:

{'matches': [
    {'id': '3', 'score': 0.0, 'values': [0.0, 0.0, 0.0, 0.0, 0.0]},
    {'id': '4', 'score': 5.0, 'values': [1.0, 1.0, 1.0, 1.0, 1.0]},
    {'id': '2', 'score': 5.0, 'values': [-1.0, -1.0, -1.0, -1.0, -1.0]},
    {'id': '1', 'score': 45.0,'values': [-3.0, -3.0, -3.0, -3.0, -3.0]},
    {'id': '5', 'score': 45.0,'values': [3.0, 3.0, 3.0, 3.0, 3.0]}],
     'namespace': ''}

Astra DB with a Euclidean index:

> CREATE CUSTOM INDEX users_survey_vector_idx ON
  stackoverflow.users (survey_vector)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {'similarity_function': 'EUCLIDEAN'};

> SELECT user_id, similarity_euclidean(survey_vector,[0,0,0,0,0])
  AS similarity FROM users
  ORDER BY survey_vector
  ANN OF [0,0,0,0,0] LIMIT 5;

 user_id | similarity | survey_vector
---------+------------+----------------------
       3 |          1 |      [0, 0, 0, 0, 0]
       2 |   0.166667 | [-1, -1, -1, -1, -1]
       4 |   0.166667 |      [1, 1, 1, 1, 1]
       5 |   0.021739 |      [3, 3, 3, 3, 3]
       1 |   0.021739 | [-3, -3, -3, -3, -3]

(5 rows)

Edit for Dot Product

Why is the score always 0? Shouldn't it be .5 based on the vectors in my database?

Made an edit to cover my bases, in the event that you were originally using a Dot Product based index. When running it with Pinecone, I get the same result as the above output with the Cosine-based index; same order, scores = zero.

However, with Astra DB:

> CREATE CUSTOM INDEX users_survey_vector_idx ON
  stackoverflow.users (survey_vector)
  USING 'StorageAttachedIndex'
  WITH OPTIONS = {'similarity_function': 'DOT_PRODUCT'};

> SELECT user_id, similarity_dot_product(survey_vector,[0,0,0,0,0])
  AS similarity FROM users
  ORDER BY survey_vector
  ANN OF [0,0,0,0,0] LIMIT 5;

 user_id | similarity | survey_vector
---------+------------+----------------------
       5 |        0.5 |      [3, 3, 3, 3, 3]
       1 |        0.5 | [-3, -3, -3, -3, -3]
       2 |        0.5 | [-1, -1, -1, -1, -1]
       4 |        0.5 |      [1, 1, 1, 1, 1]
       3 |        0.5 |      [0, 0, 0, 0, 0]

(5 rows)

Now, I'm not sure why Pinecone isn't computing scores for Dot Product. But Astra DB seems to handle this one with scores that match what you were expecting.

Aaron
  • 55,518
  • 11
  • 116
  • 132
  • 1
    Man thank you for your comprehensive reply! I checked Pinecone and Weaviate and found out that both are behaving the same. I got the "best" results with a euclidean index. It actually gives you a comparable score, but it is not linear so I still gotta figure that out. But at least the vectors are compared as I desire. – Andre Zimpel Aug 30 '23 at 06:24
  • @AndreZimpel glad you got it working! I probably could have saved myself a lot of typing by just saying "try a Euclidean index." LOL – Aaron Aug 30 '23 at 12:57