Post-filtering in Weaviate

Question

I have a Weaviate instance running (ver 1.12.2). I am playing around with the Python client https://weaviate-python-client.readthedocs.io/en/stable/ (ver 3.4.2) (add - retrieve - delete objects...etc...)

I am trying to understand how filtered vector search works (outlined here https://weaviate.io/developers/weaviate/current/architecture/prefiltering.html#recall-on-pre-filtered-searches)

When applying pre-filtering, an 'allow-list' of object ids is constructed before carrying out vector search. This is done by using some property to filter out objects.

For example the Where filter I'm using is:

where_filter_1 = {
  "path": ["user"],
  "operator": "Equal",
  "valueText": "billy"
}

This is because I've got many users whose data are kept in this DB and I would like for each user to be able to search their own data. In this case it is Image data.

This is how I implement this using the python client:

result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
                        .with_where(where_filter_1)\
                        .with_near_vector(nearVector)\
                        .do()

I do not use any Vectorization modules so I create my own vector and pass it to the DB for vector search using .with_near_vector(nearVector) after I have applied the filter with with_where(where_filter_1). This does work as I expect it so I think I'm doing this correctly.

I'm less sure if I'm applying post-filtering correctly: Each image has some text attached to it. I use the Where filter to search through the text by using the inverted index structure.

where_filter_2 = {
  "path": ["image_text"],
  "operator": "Like",
  "valueText": "Paris France"
}

I apply post filtering like this:

 result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
                        .with_near_vector(nearVector)\
                        .with_where(where_filter_2).do()

However, I don't think I'm doing this properly. A basic inverted index search: (so just searching with text)

result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
                        .with_where(where_filter_2).do()

(Measured with the tqdm module) Gives me about 5 iters/sec. With 38k objects in the DB

While the post-filtering approach gives me the same performance, at 5 iters/sec

Am I wrong to find this weird? I was expecting performance closer to pure vector search:

 result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
                        .with_near_vector(nearVector).do()

Which is close to 60 iters/sec (The flat search cut-off is set to 60k, so only brute-force search is used here)

Is the 'Where' filter applied only on the results supplied by the vector search? If so, shouldn't it be much faster? The filter would only be applied to 100 objects at most since that is the default number of results of vector search.

This is kind of confusing. Am I wrong in my understanding of how search works? Thanks for reading my question !

etiennedi · Answer 1 · 2022-05-05T15:14:43.513

Your question seems to imply that you are switching between a pre- and post-filtering approach. But as of v1.13 all filtered vector searches are using pre-filtering. There is currently no option for post-filtering. That explains why both your searches have identical results. Your are mostly experiencing the cost of building the filter.

Side-Note 1:

I see that you are using a Like operator. The Like operator only differs from the Equal operator if you are using wildcards. Since you are not using them, you can also use the Equal operator which tends to be more efficient in many cases. (I'm not sure if that applies to your case, but it tends to be true overall)

Side-Note 2:

If you are measuring throughput from a single client thread, i.e. using tqdm from a python script (without using multi-threading), you're not maxing out Weaviate. Since you only start sending the second query once the first has been processed client-side Weaviate will be idle most of the time. If you are interested in the maximum throughput, you need to make sure that you have at least as many client threads as you have cores on the server to max out Weaviate.

Thanks for this! It was really helpful. As I suspected ! Are there plans to add post-filtering in the future ? Noted the bit about the 'Like' and 'Equal' operators thanks for that too ! — Billy.G, May 05 '22 at 15:37

Post-filtering in Weaviate

1 Answers1

Side-Note 1:

Side-Note 2: