I have a Weaviate instance running (ver 1.12.2). I am playing around with the Python client https://weaviate-python-client.readthedocs.io/en/stable/ (ver 3.4.2) (add - retrieve - delete objects...etc...)
I am trying to understand how filtered vector search works (outlined here https://weaviate.io/developers/weaviate/current/architecture/prefiltering.html#recall-on-pre-filtered-searches)
When applying pre-filtering, an 'allow-list' of object ids is constructed before carrying out vector search. This is done by using some property to filter out objects.
For example the Where filter I'm using is:
where_filter_1 = {
"path": ["user"],
"operator": "Equal",
"valueText": "billy"
}
This is because I've got many users whose data are kept in this DB and I would like for each user to be able to search their own data. In this case it is Image data.
This is how I implement this using the python client:
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_where(where_filter_1)\
.with_near_vector(nearVector)\
.do()
I do not use any Vectorization modules so I create my own vector and pass it to the DB for vector search using .with_near_vector(nearVector)
after I have applied the filter with with_where(where_filter_1)
. This does work as I expect it so I think I'm doing this correctly.
I'm less sure if I'm applying post-filtering correctly: Each image has some text attached to it. I use the Where filter to search through the text by using the inverted index structure.
where_filter_2 = {
"path": ["image_text"],
"operator": "Like",
"valueText": "Paris France"
}
I apply post filtering like this:
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_near_vector(nearVector)\
.with_where(where_filter_2).do()
However, I don't think I'm doing this properly. A basic inverted index search: (so just searching with text)
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_where(where_filter_2).do()
(Measured with the tqdm module) Gives me about 5 iters/sec. With 38k objects in the DB
While the post-filtering approach gives me the same performance, at 5 iters/sec
Am I wrong to find this weird? I was expecting performance closer to pure vector search:
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_near_vector(nearVector).do()
Which is close to 60 iters/sec (The flat search cut-off is set to 60k, so only brute-force search is used here)
Is the 'Where' filter applied only on the results supplied by the vector search? If so, shouldn't it be much faster? The filter would only be applied to 100 objects at most since that is the default number of results of vector search.
This is kind of confusing. Am I wrong in my understanding of how search works? Thanks for reading my question !