4

When a query is issued, Vespa runs the query on all the content nodes (in the distribution group) and returns the results. I have two keys that are always present in the search query. Can I partition the data from the values of those keys, so that whenever I query Vespa knows where to look for instead of querying on all the content nodes?

Harsh Choudhary
  • 475
  • 5
  • 12

2 Answers2

2

In most cases you don't want to use attributes of the data to control distribution to content nodes because it will lead to uneven load on the nodes. Uneven load means increased cost (as your cost will be equal to the case where all nodes were as loaded as your most loaded node), and potentially operational problems (as adding nodes will not mitigate overload problems if you keep adding certain documents to the same node).

Usually the savings you might get from doing this are modest as it is very cheap to determine that there are no matches on a content node, and far from offsetting the disadvantages mentioned above. Exceptions might be if the query is very large (tensors with thousands of elements) or you would have a very large fan-out (hundreds of nodes) without this.

Jon
  • 2,043
  • 11
  • 9
1

This is not supported for the general use-case you describe. The closest feature you might want to look into is to use streaming mode and e.g. use a hash of the two keys as the document group ID.

Details on streaming mode: https://docs.vespa.ai/documentation/streaming-search.html

Information on document ID schemes: https://docs.vespa.ai/documentation/documents.html

Another option, but probably not a good fit for your use-case, is to separate the documents into different document types.

  • Yes, that's what we are thinking too. When we index, we will put the documents, according to the key values, to different document types. When we query we know which document type to query beforehand, and that can reduce the search-set. Is there a limit to how many document types can co-exist? – Harsh Choudhary Jun 18 '20 at 18:42
  • There is no limit as such, but there is a overhead per document type. So if you are to go this route you would want to limit the number of document types, e.g. by using a subset of the bits in the hash mentioned above. By using different document types you will maintain equal distribution of documents to content nodes. Still, this will lead to more complexity, and it is a very valid question if this will lead to improved performance. – Yngve Aasheim Jun 29 '20 at 05:57