Fastest way to query repeatedly big table

Question

I have a table with 50 million records, but a compact structure (id, int1, int1, int2, int3). All relevant indexes are implemented.

I need to query it about 50 times for each user interaction. This is taking about 5 seconds, using regular prepared queries against the db server. All the queries are simple selects.

My question is: what can I do to speed up this, even spending significantly more memory? The Locate method for queries is not flexible enough and using filters directly into the query is slow.

The main query I am running is

select knowledge_id, knowledge_family_id, tag_level, tag_order, 
  total_title_direct_words, total_title_parenthesis_words from knowledge_tags 
  where dictionary_word_id = XX order by tag_level, tag_order

Can anyone suggest a strategy? Would TVirtualTable increase speed?

To answer this we need more information about the nature of these 50 queries. Are they all SELECTs for individual rows? Could you group some of the queries together into larger, more complex queries? — Matt Allwood, Aug 02 '13 at 10:35
Sorry. All queries are simple selects. With the results of each query I'm building a TDictionary adding the results. I don't think there is a way to group this into less queries. — Miguel E, Aug 02 '13 at 10:54
`All queries are simple selects` without limitation like `where ID in ` or anything else? — bummi, Aug 02 '13 at 11:07
@bummi All queries look like (ex.): "select distinct a, b, c, d from tableX where (e = :a) and (d < 40)" — Miguel E, Aug 02 '13 at 11:10
@Miguel: what does prevent you to write a helper class with your own Locate (a flexible enough one) and use it? — AlexSC, Aug 02 '13 at 11:11
@AlexSC To work with Locate i would have to load the whole table to memory, that's not a problem. The problem is to iterate through Locate result set (with query.Next), I would have to perform a comparison between each record and my filter values, to check for result set finish. I believe this is slower, since I have an average of 100 records in the result set. — Miguel E, Aug 02 '13 at 11:17
If you're using `query.next` make sure you `disablecontrols`, otherwise it will be slow. If you can cache 'interesting' records it would also help. Iterating over 100 records a few times really shouldn't take 5s, though. — Matt Allwood, Aug 02 '13 at 11:27
Did you profile/time this to confirm that the actual query retrievals take up this 5 seconds (I assume yes). Then you would have to look at tuning/optimizing the database or Unidac. — Jan Doggen, Aug 02 '13 at 11:46
There's no answer to this that isn't just wild speculation without more information on the data (and number of rows) and specifics about the queries you're running. "My car sounds funny when I drive it. How can I make the sound stop? It's coming from the front somewhere." isn't enough diagnostic information for a mechanic to prepare a workorder and cost estimate. — Ken White, Aug 02 '13 at 12:44
There cannot be a valid reason for the way you've modeled your data. The fact you use distinct reveals a lack of normalization. Denormalization is only justified when it is specifically done to improve performance. Pre-existing unnormalized data is simply poor database design. — Sam, Aug 03 '13 at 03:26
@TobyAllen This would be the main query: "select knowledge_id, knowledge_family_id, tag_level, tag_order, total_title_direct_words, total_title_parenthesis_words from knowledge_tags where dictionary_word_id = XX order by tag_level, tag_order". — Miguel E, Aug 03 '13 at 11:17

score 3 · Answer 1 · answered Aug 02 '13 at 12:31

Considering that to load the entire table into the memory is not a problem I suggest:

Execute a query to select all the records
Walk on the dataset one single time to copy the records into an instance of TObjectList<TMyRecord>, where TMyRecord is class that can contain all the fields of a dataset record; making it so what you have now is a copy of the dataset in an object list
If possible, sort the list (or select it correctly ordered) so you can run a binary search when you need
Free the dataset and work only on the list.

In this scenario you will avoid all the dataset overhead and each search will be much faster.

An example of the TMyRecord is:

interface

type
  TMyRecord = class
  private
    FId: Integer;
    FCol1: Integer;
    FCol2: Integer;
    FCol3: Integer;
    FCol4: Integer;
  public
    constructor Create(aId, aCol1, aCol2, aCol3, aCol4: Integer);
    property Id: Integer read FId write FId;
    property Col1: Integer read FCol1 write FCol1;
    property Col2: Integer read FCol2 write FCol2;
    property Col3: Integer read FCol3 write FCol3;
    property Col4: Integer read FCol4 write FCol4;
  end;

implementation

constructor TMyRecord.Create(aId, aCol1, aCol2, aCol3, aCol4: Integer);
begin
  FId := aId;
  FCol1 := aCol1;
  FCol2 := aCol2;
  FCol3 := aCol3;
  FCol4 := aCol4;
end;

@TLama: when working with records you have lots of memory transfers, since records are value types. Classes are reference types, so the data transfered is just a processor word and in many times it uses a CPU register. Considering speed is the main concern here, I have do disagree. — AlexSC, Aug 02 '13 at 13:25
Records only cause memory transfers if you copy them around. A TList isn't going to cause copies, nor will dealing with const parameters. Classes have extra memory overhead that will add up after 50M records. — afrazier, Aug 05 '13 at 00:42

EProgrammerNotFound · Accepted Answer · 2013-08-02T14:30:01.770

2

I my opinion it would be much faster load entire data in a TClientDataSet and use FINDKEY to Seek those records.

To use FindKey() you must define Indexes like this:

Cds.IndexDefs.ADD('IDX1', 'FieldA;FieldB',[]);
Cds.IndexName := 'IDX1';
if Cds.FindKey([A,B]) then begin
  //Do something
end;

You can also create multiple indexes and use it whenever you want:

Cds.IndexDefs.ADD('IDX1', 'FieldA;FieldB',[]);
Cds.IndexDefs.ADD('IDX2', 'FieldD;FieldB',[]);
Cds.IndexDefs.ADD('IDX3', 'FieldA;FieldC',[]);  

if SeekAB then 
  Cds.IndexName := 'IDX1'
else if SeekDB then 
  Cds.IndexName := 'IDX2'
else 
  Cds.IndexName := 'IDX3';

edited Aug 02 '13 at 14:30

answered Aug 02 '13 at 14:13

EProgrammerNotFound

2,403
4
28
59

1

I've replaced the main query execution repetition with a single query, loading the whole dataset into memory at app startup. To query I'm using Locate and the navigating with Next until finding a non relevant record. The dataset is ordered in a way I can do this. The 5s have dropped to 0.5s. I believe this is the way. The only caveat is the startup time, due to the loading of the dataset. Also I must implement a transparent way to add new records to the opened query, after new data is added to the queried table. There will be a delay for data availability,but speed improvement justifies it. – Miguel E Aug 03 '13 at 11:24
@MiguelE Have you tried to use: `TADOQuery` > `TADOProvider` > `TClientDataSet` – EProgrammerNotFound Aug 05 '13 at 11:17
@MiguelE Also, if your provider supports `SEEK`, this method could be use for queries. It is so fast as `FindKey` with indexes. – EProgrammerNotFound Aug 05 '13 at 11:19
I'm using UniDAC from Devart. – Miguel E Aug 05 '13 at 12:04
Well, I never used this component. But If it inherits TADOQuery, you can use `.Supports(coSeek)` to check if either supports or not `Seek` method – EProgrammerNotFound Aug 05 '13 at 12:19

Fastest way to query repeatedly big table

2 Answers2