11

We have a table logging data. It is logging at say 15K rows per second.

Question: How would we limit the table size to the 1bn newest rows?

i.e. once 1bn rows is reached, it becomes a ring buffer, deleting the oldest row when adding the newest.

Triggers might load the system too much. Here's a trigger example on SO. We are already using a bunch of tweaks to keep the speed up (such as stored procedures, Table Parameters etc).

Edit (8 years on) :

My recent question/answer here addresses a similar issue using a time series database.

Nick T
  • 897
  • 8
  • 30
  • After some consideration, we re-examined the customer's question and concluded that despite asking for a ring buffer, what the customer really wanted was a rolling record of the last billion rows. This can be implemented with a cron job that periodically wakes up, does a count then deletes the oldest count - 1bn rows. There are drawbacks, the main one being that it might load the server to an extent that some data is not captured while the cron job is running. So it will need some experimentation with frequency of execution. – Nick T Apr 03 '14 at 07:55

5 Answers5

2

Unless there is something magic about 1 billion, I think you should consider other approaches.

The first that comes to mind is partitioning the data. Say, put one hour's worth of data into each partition. This will result in about 15,000*60*60 = 54 million records in a partition. About every 20 hours, you can remove a partition.

One big advantage of partitioning is that the insert performance should work well and you don't have to delete individual records. There can be additional overheads depending on the query load, indexes, and other factors. But, with no additional indexes and a query load that is primarily inserts, it should solve your problem better than trying to delete 15,000 records each second along with the inserts.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thank you for the answer. We looked at partitioning for another reason before, one issue is you need Enterprise version of SQL Server (http://technet.microsoft.com/en-us/library/cc645993.aspx). Plus it is quite a complex thing to configure. But a good suggestion if your system is at that level of sophistication. – Nick T Mar 28 '14 at 12:01
  • 7
    @NickT . . . If you are doing 15k inserts per second, you are at that level of sophistication. – Gordon Linoff Mar 28 '14 at 12:10
1

I don't have a complete answer but hopefully some ideas to help you get started.

I would add some sort of numeric column to the table. This value would increment by 1 until it reached the number of rows you wanted to keep. At that point the procedure would switch to update statements, overwriting the previous row instead of inserting new ones. You obviously won't be able to use this column to determine the order of the rows, so if you don't already I would also add a timestamp column so you can order them chronologically later.

In order to coordinate the counter value across transactions you could use a sequence, then perform a modulo division to get the counter value.

In order to handle any gaps in the table (e.g. someone deleted some of the rows) you may want to use a merge statement. This should perform an insert if the row is missing or an update if it exists.

Hope this helps.

Mark Wagoner
  • 1,729
  • 1
  • 13
  • 20
0

Here's my suggestion:

  • Pre-populate the table with 1,000,000,000 rows, including a row number as the primary key.
  • Instead of inserting new rows, have the logger keep a counter variable that increments each time, and update the appropriate row according to the row number.

This is actually what you would do with a ring buffer in other contexts. You wouldn't keep allocating memory and deleting; you'd just overwrite the same array over and over.

Update: the update doesn't actually change the data in place, as I thought it did. So this may not be efficient.

Community
  • 1
  • 1
  • Indeed, in fact we may well move this aspect over to a prepoluated binary file in future and dispense with SQL Server altogether. Saves a fortune in licensing fees too. – Nick T Mar 28 '14 at 12:02
0

Just an idea that is to complicated to write in a comment.

Create a few log tables, 3 as an example, Log1, Log2, Log3

CREATE TABLE Log1 (
    Id int NOT NULL
        CHECK (Id BETWEEN 0 AND 9)
   ,Message varchar(10) NOT NULL
   ,CONSTRAINT [PK_Log1] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
CREATE TABLE Log2 (
    Id int NOT NULL
        CHECK (Id BETWEEN 10 AND 19)
   ,Message varchar(10) NOT NULL
   ,CONSTRAINT [PK_Log2] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
CREATE TABLE Log3 (
    Id int NOT NULL
        CHECK (Id BETWEEN 20 AND 29)
   ,Message varchar(10) NOT NULL
   ,CONSTRAINT [PK_Log3] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)

Then create a partitioned view

CREATE VIEW LogView AS (
    SELECT * FROM Log1
  UNION ALL
    SELECT * FROM Log2
  UNION ALL
    SELECT * FROM Log3
)

If you are on SQL2012 you can use a sequence

CREATE SEQUENCE LogSequence AS int 
    START WITH 0
    INCREMENT BY 1
    MINVALUE 0
    MAXVALUE 29
    CYCLE
;

And then start to insert values

INSERT INTO LogView (Id, Message)
SELECT NEXT VALUE FOR LogSequence
      ,'SomeMessage'

Now you just have to truncate the logtables on some kind of schedule

If you don't have sql2012 you need to create the sequence some other way

adrianm
  • 14,468
  • 5
  • 55
  • 102
0

I'm looking for something similar myself (using a table as a circular buffer) but it seems like a simpler approach (for me) will be just to periodically delete old entries (e.g. the lowest IDs or lowest create/lastmodified datetimes or entries over a certain age). It's not a circular buffer but perhaps it is a close enough approximation for some. ;)

Zeek2
  • 386
  • 4
  • 8
  • I more recently solved this with a Docker Redis container and the lpush and rpop commands https://stackoverflow.com/questions/67301385/how-to-cache-individual-django-rest-api-posts-for-bulk-create/67435451#67435451 – Nick T Aug 10 '21 at 08:55