1

Short version of the question:

If you have a table with a large number of small rows and you want to retrieve a single record from this table via an index probably consisting of two columns is this likely to be something that wil be low cost and fast or high cost and slow

Longer version of question and background:

I am a consultant working with a software development company and I have an argument with them about the performance implications of a piece of functionality that I want to add to the application they are building (and I am designing).

At the moment, we write out a log record every time somebody retrieves a client record. I want to put the name and time of the last person prevously to access that record onto the client page each time that record is retrieved.

They are saying that the performance implications of this will be high but based on my reasonable but not expert knowledge of how B trees work, this doesn't seem right even if the table is very large.

If you create an index on the GUID of the client record and the date/time of access (descending), then you ought to be able to retrieve the required record via an index scan which would just need to find the first entry for that GUID and then stop? And that with a b-tree index, most of the index would be cached so the number of physical disc accesses needed would be very small and the query time therefore significantly less than 1s.

Or have I got this completely wrong

Lieven Keersmaekers
  • 57,207
  • 13
  • 112
  • 146
Richard C
  • 13
  • 2

4 Answers4

1

You will have problems with GUID index fragmentation but because your rows do not increase in size (as you said in the comments) you will not have page-splitting problems. The random insert issue is fixable by doing reorganizing and rebuilding.

Besides that, there is nothing wrong with your approach. If the table is larger than RAM you will likely have a single disk IO per access (the intermediate index levels will be cached). If your data fits in RAM you will pay about 0.2 to 0.5ms per query. If your data is on a magnetic disk a seek will likely require 8-12ms. On an SSD you are back to 0.2ms to 0.5ms (maybe 0.05ms more).

Why don't you just create some test data (by selecting a cross product from sys.object of 1M rows) and measure it. It takes little time and you will find out for sure.

usr
  • 168,620
  • 35
  • 240
  • 369
  • Not sure I understand the comment about increasing row size. Each time a row is written it will be of a fixed size and it will not be updated. And just to understand your second paragraph - if the table doesn't fit in RAM (which it almost certainly won't), then the row required will be retreived with single disc access which will take 0.5ms? – Richard C Jan 31 '12 at 23:57
  • We also have a sequentially increasing reference number for each client which we could use instead of the GUID. My understanding is that this would remove problems of index page fragmentation/reorganisation. – Richard C Feb 01 '12 at 19:35
  • It would. I added some advice. – usr Feb 01 '12 at 21:05
0

should be low cost and fast since the columns are indexed and that would be O(n) I think

mdprotacio
  • 842
  • 6
  • 18
  • Thanks. Sorry don't understand O(n). – Richard C Jan 31 '12 at 22:09
  • 1
    If you don't know what big-O notation is, you might want to self-educate on that before you start making assumptions about performance. This might serve as a good introduction: http://stackoverflow.com/questions/3255/big-o-how-do-you-calculate-approximate-it – mikurski Jan 31 '12 at 23:22
  • 1
    Ok - now undestand Big O (I think). If that is the case, O(n) implies that the search algorithm has a linear relationship between the nubmer of records in the index and the time. My understanding is that this is not the case, that once the index has reached a certain size, significant growth can occur without significant increases in search time (provided the pages do not become fragmented) – Richard C Feb 01 '12 at 19:39
0

You say last person to access? You mean that for every read you will have a write?
And that write is going to change an indexed date time column?

Then I would be worried too.

Writing on each record read will cause you lots of extra disk writes. This will block reads and it might be bad to your caching too. You also need to update your index a lot, and since you change the indexed data your index will be very fragmented.

Albin Sunnanbo
  • 46,430
  • 8
  • 69
  • 108
  • Sorry - just to be clear, the log file gets updated every time somebody retrieves a client for the first time and displays the summary page for this client; not everytime that client record is read - which would be a much higher rate. And although we are adding records to the log file, they are not being updated. So we are just doing inserts into the index. Point taken that writing the log might have an impact but we are already doing this. It is the question of whether reading this file in teh way that I describe is going to slow or quick. Thanks for your reply, tho. – Richard C Jan 31 '12 at 22:08
0

It depends.

A single retrieval will be low cost and fast

  • on a decent indexed table
  • running on decent hardware
  • over a decent network

On the other hand, it takes time nonetheless.

If we are talking about one retrieval per hour, don't sweat over it. If we are talking about thousands of retrievals per second (as opposed to currently none) it will start to add up to the point it would be noticable.

Some questions you need to adress

  • Is my hardware up to spec
  • Does adding two fields result in a page split (unlikely)
  • How many extra pages need to be read for your regular result sets
  • How many retrievals/sec will be made
  • How many inserts/sec (triggering an index update) will be made

After you've adressed these questions, you should be able to make the determination yourself. As far as my gut feelings go, I would be surprised you would notice the performance difference.

Lieven Keersmaekers
  • 57,207
  • 13
  • 112
  • 146
  • Thanks - that's helpful. Hardware and network are Ok. I reckon we are looking a retrieval (and a log record being written) once every few seconds. The log file columns are Date/Time; UserGUID; ClientGUID; AccessType. The software developer is concerned about the impact of this file growing large (eg several millions) but my recollection of b-trees is that they are insensitive to file size – Richard C Feb 01 '12 at 19:29