How to correctly index where conditions of a query?

Question

I have the following pojo which maps a db row entry:

public class Pojo{
  //key
  private String a;
  private String b;
  private String c;

  //other columns
  private String d;
  private String e;
  private String f;

  //defining attributes on each field with capital letter (ex: a->A)
}

I create the following collection:

IndexedCollection<Pojo> cq = new ConcurrentIndexedCollection<Pojo>();
//...loading data in collection from DB...
cq.addIndex(NavigableIndex.onAttribute(Pojo.A)); //part of key in DB
cq.addIndex(NavigableIndex.onAttribute(Pojo.F)); //not part of key in DB

Finally I measure the performances of the following query against a 200k elements taken from the db (all the table):

Query<Pojo> query1 = and(equal(Pojo.A, par1),
                equal(Pojo.F, par2));

Equivalent, of course, to:

select* where A=? and F=?

but it seems my index strategy, in which I define an index for each parameter of the query, is lacking something as my query just accellerates the processing by mere 7ms comparing to direct DB access. Having all the table in the memory I would expect some better performances...what am I doing wrong?

score 4 · Accepted Answer · answered Jan 18 '17 at 22:25

I'm the author of CQEngine so I hope this helps. You are probably encountering an excessive amount of filtering, due to how the indexes are configured.

Say you have a collection of Car objects, each of which has a COLOR attribute and a MANUFACTURER attribute.

If you add an index on COLOR, and a separate index on MANUFACTURER, then CQEngine will be able to retrieve the set of 'blue' cars quickly, or it will be able to retrieve the set of 'Ford' cars quickly. (..so far so good..)

However if you try to retrieve cars which are 'blue' AND which are manufactured by 'Ford' (that is, a complex and() query), then you are not looking for the set of 'blue' cars or the set of 'Ford' cars anymore - you require the intersection of the sets.

So in this scenario, CQEngine will find there is no single index which can return the intersection. The indexes are suboptimal.

Evaluating queries with suboptimal indexes

To answer the query, CQEngine will use statistics from the two available indexes to determine which of the two subqueries matches the fewest cars. That is, which set is smaller: the set of 'blue' cars, or the set of 'Ford' cars?

Let's say there was 1 million Cars in the collection. Of those, let's say 100K Cars were blue, and 90K Cars were manufactured by Ford.

CQEngine would answer the query by retrieving the 90K 'Ford' cars from the index on MANUFACTURER, and filtering each of the 90K cars to determine if it was also 'blue'.

It's quite possible that only 5K Cars in the collection were both 'blue' AND manufactured by 'Ford'. But as the indexes are not optimal to answer requests like that, 90K cars would be scanned and filtered.

Note: I've simplified this example, because in practice most of the filtering will be lazy, and avoided, as the application is unlikely to request and then iterate through thousands of blue Ford cars in a single request.

Avoiding filtering

If you need to reduce the latency of your queries, you need to consider ways to avoid the filtering which may occur as described above.

So in this case, you could considering adding a single CompoundIndex on A and F, instead of the two separate indexes.

How to correctly index where conditions of a query?

1 Answers1