4

I have a large database (50 million rows) containing time series data. There is a clustered index on the [datetime] column which ensures that that the table is always sorted in chronological order.

What is the most performant way to read the rows of the table out into a C# app, on a row-by-row basis?

John Saunders
  • 160,644
  • 26
  • 247
  • 397
Contango
  • 76,540
  • 58
  • 260
  • 305
  • 3
    "There is a clustered index on the [datetime] column which ensures that that the table is always sorted" - it ensures that asking for the rows in that same order is a *relatively* cheap operation. It does not make any guarantees about the actual order in which rows are stored, nor on the order in which rows are retrieved, absent an `ORDER BY` clause. – Damien_The_Unbeliever Jun 24 '11 at 14:25
  • 3
    Also, are you sure consuming 50 million rows in a C# application, one row at a time, is the best way to process this data? – Damien_The_Unbeliever Jun 24 '11 at 14:26
  • 1
    Also, would it be difficult for you to try this out and see how it performs? Use just a simple query like `SELECT * FROM MyTable ORDER BY clustered_index_column` – John Saunders Jun 24 '11 at 15:13
  • @John Saunsers MyTable 300 million rows and 50GB. The only way to stream this data into my app is to read it out slowly in rows or chunks. – Contango Jun 24 '11 at 15:37
  • 1
    @Gravitas: 50 million or 300 million? – John Saunders Jun 24 '11 at 17:58
  • @John Saunders It was 50 million when I posted - its now 330 million. The upload is very quick, I have it running at 100k rows per second on commodity hardware, despite the fact it has a clustered index on the [datetime] column. – Contango Jun 28 '11 at 15:17
  • @Gravitas: see my answer below. The first row came immediately. – John Saunders Jun 28 '11 at 15:18

3 Answers3

2

I would use a SqlDataReader as it streams its results. You'll still have to specify the ordering but if you're using the clustered index to ORDER BY it should be a (relatively) cheap operation.

using (var db = new SqlConnection(connStr)) {
    using (var rs = new SqlCommand(someQuery, db).ExecuteReader()) {
        while (rs.Read()) {
            // do interesting things!
        }
    }
}
Yuck
  • 49,664
  • 13
  • 105
  • 135
  • This would work for small databases, but not for large databases. The problem is that if I first did a select on the entire table, it would take 30 minutes for it to complete, at which point I could begin streaming the results back into my app. Is there a quicker way? – Contango Jun 24 '11 at 14:14
  • 1
    @Gravitas: This will put the least amount of hurt on the server. An alternative (in theory anyway) would be to fill up a `DataSet` with the information you need to process. That *can* be a quicker operation, but given the size of your table I think in practice it would choke and die. – Yuck Jun 24 '11 at 14:14
  • Perhaps use a cursor to populate a temporary table on the server, then transfer the temporary table using SqlDataReader? – Contango Jun 24 '11 at 14:16
  • 1
    @Gravitas: Oh, gross. I forgot your ordering clause will require the whole set to be evaluated before results can be returned. – Yuck Jun 24 '11 at 14:16
  • 1
    @Yuck: are you sure? I would think that the clustered index would be taken into account and, without any aggregates in the query, the rows could start to be returned at once. – John Saunders Jun 24 '11 at 15:12
  • 1
    @John: Yes, the rows will *most likely* be returned in clustered index order. Of course the one thing to always keep in mind is that SQL does not guarantee ordering unless you explicitly specify it. That means the optimizer/db engine can return data however it believes is most efficient. In almost every case that will mean using the clustered index ordering, but it *could* be some other order. – Yuck Jun 24 '11 at 15:19
  • 1
    @Yuck: I wasn't so much referring to the order, as to the suggestion that the entire set would need to be sorted before the first row could be returned. I would think that, because of the index, rows could be returned more quickly than that. – John Saunders Jun 24 '11 at 15:24
  • Even if the rows started to be returned immediately, the entire query would continue to be executed in the background, which would be sub-optimal if you only wanted to retrieve the first 1% of 50 milliom rows. Might need a cursor or a cursor analogue. – Contango Jun 24 '11 at 15:33
  • 1
    @Gravitas: if you only want the first 1%, then use a `TOP` clause in the query. – John Saunders Jun 27 '11 at 20:23
  • 1
    You can always exit out of the read loop on a datareader, so once you've read the 1%, just stop looping. right? – Brady Moritz Oct 30 '11 at 20:44
2

You should try this and find out. I just did, and saw no performance issues.

USE [master]
GO
/****** Object:  Database [HugeDatabase]    Script Date: 06/27/2011 13:27:50 ******/
CREATE DATABASE [HugeDatabase] ON  PRIMARY 
( NAME = N'HugeDatabase', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL10_50.SQL2K8R2\MSSQL\DATA\HugeDatabase.mdf' , SIZE = 1940736KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB )
 LOG ON 
( NAME = N'HugeDatabase_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL10_50.SQL2K8R2\MSSQL\DATA\HugeDatabase_log.LDF' , SIZE = 395392KB , MAXSIZE = 2048GB , FILEGROWTH = 10%)
GO

USE [HugeDatabase]
GO
/****** Object:  Table [dbo].[HugeTable]    Script Date: 06/27/2011 13:27:53 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[HugeTable](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [PointInTime] [datetime] NULL,
PRIMARY KEY NONCLUSTERED 
(
    [ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE CLUSTERED INDEX [IX_HugeTable_PointInTime] ON [dbo].[HugeTable] 
(
    [PointInTime] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

Populate:

DECLARE @t datetime
SET @t = '2011-01-01'

DECLARE @i int
SET @i=0

SET NOCOUNT ON

WHILE (@i < 50000000)
BEGIN
    INSERT INTO HugeTable(PointInTime) VALUES(@t)
    SET @t = DATEADD(ss, 1, @t)

    SET @i = @i + 1
END

Test:

using System;
using System.Data.SqlClient;
using System.Diagnostics;

namespace ConsoleApplication1
{
    internal class Program
    {
        private static void Main()
        {
            TimeSpan firstRead = new TimeSpan();
            TimeSpan readerOpen = new TimeSpan();
            TimeSpan commandOpen = new TimeSpan();
            TimeSpan connectionOpen = new TimeSpan();
            TimeSpan secondRead = new TimeSpan();

            try
            {

                Stopwatch sw1 = new Stopwatch();
                sw1.Start();
                using (
                    var conn =
                        new SqlConnection(
                            @"Data Source=.\sql2k8r2;Initial Catalog=HugeDatabase;Integrated Security=True"))
                {
                    conn.Open(); connectionOpen = sw1.Elapsed;

                    using (var cmd = new SqlCommand(
                        "SELECT * FROM HugeTable ORDER BY PointInTime", conn))
                    {
                        commandOpen = sw1.Elapsed;

                        var reader = cmd.ExecuteReader(); readerOpen = sw1.Elapsed;

                        reader.Read(); firstRead = sw1.Elapsed;
                        reader.Read(); secondRead = sw1.Elapsed;
                    }
                }
                sw1.Stop();
            }
            catch (Exception e)
            {
                Console.WriteLine(e);
            }
            finally
            {

                Console.WriteLine(
                    "Connection: {0}, command: {1}, reader: {2}, read: {3}, second read: {4}",
                    connectionOpen,
                    commandOpen - connectionOpen,
                    readerOpen - commandOpen,
                    firstRead - readerOpen,
                    secondRead - firstRead);

                Console.Write("Enter to exit: ");
                Console.ReadLine();
            }
        }
    }
}
John Saunders
  • 160,644
  • 26
  • 247
  • 397
  • Wow, really nice answer - I'm running the example now. – Contango Jun 28 '11 at 15:30
  • This works really nicely. However, if unless you use "Top 1000000" to limit the number of rows that are returned from the select statement, there is a 30 second gap if the using() statement is exited early (i.e. if the user presses "cancel"). – Contango Jun 28 '11 at 15:49
  • 1
    @Gravitas: yeah, on my machine, the Dispose of the SqlDataReader timed out. That's why there's no `using` block around it. – John Saunders Jun 28 '11 at 16:19
0

See Row Offset in SQL Server, which mentions:

Community
  • 1
  • 1
Contango
  • 76,540
  • 58
  • 260
  • 305