7

I have to implement an algorithm on data which is (for good reasons) stored inside SQL server. The algorithm does not fit SQL very well, so I would like to implement it as a CLR function or procedure. Here's what I want to do:

  • Execute several queries (usually 20-50, but up to 100-200) which all have the form select a,b,... from some_table order by xyz. There's an index which fits that query, so the result should be available more or less without any calculation.

  • Consume the results step by step. The exact stepping depends on the results, so it's not exactly predictable.

  • Aggregate some result by stepping over the results. I will only consume the first parts of the results, but cannot predict how much I will need. The stop criteria depends on some threshold inside the algorithm.

My idea was to open several SqlDataReader, but I have two problems with that solution:

  • You can have only one SqlDataReader per connection and inside a CLR method I have only one connection - as far as I understand.

  • I don't know how to tell SqlDataReader how to read data in chunks. I could not find documentation how SqlDataReader is supposed to behave. As far as I understand, it's preparing the whole result set and would load the whole result into memory. Even if I would consume only a small part of it.

Any hint how to solve that as a CLR method? Or is there a more low level interface to SQL server which is more suitable for my problem?

Update: I should have made two points more explicit:

  1. I'm talking about big data sets, so a query might result in 1 mio records, but my algorithm would consume only the first 100-200 ones. But as I said before: I don't know the exact number beforehand.

  2. I'm aware that SQL might not be the best choice for that kind of algorithm. But due to other constraints it has to be a SQL server. So I'm looking for the best possible solution.

Achim
  • 15,415
  • 15
  • 80
  • 144
  • I would just write a small C# app if a stright stored procedure will not work. – Jon Raynor Aug 22 '11 at 20:49
  • 2
    Not knowing what you are doing exactly, I cringe when you talk about iterating through the data step by step. Have you tried to find a set-based way to handle this? Set-based methods can be quite sophisticated and light years faster that row-by-agonizing-row. That said, yes there may be some very complex calcuations that need to be handled one row at time, I just haven't run into very many through the years. – HLGEM Aug 22 '11 at 21:38
  • I have actually implemented it using sets, but had to "miss-use" them somewhat. I probably would have written your answer to myself, but the set based code is really awkward and is becoming a nightmare to maintain and to debug. I usually don't like cursor operations, but because I know, that I'll travers an index in sorted order, I think it's the best option. The only alternative might be DSL which generates SQL code. – Achim Aug 22 '11 at 21:44
  • 1
    Have you already seen [Adam Machanic's article on CLR and running totals](http://sqlblog.com/blogs/adam_machanic/archive/2006/07/12/running-sums-yet-again-sqlclr-saves-the-day.aspx)? He hints at a faster method that he hasn't blogged about yet as well for some reason. – Martin Smith Aug 31 '11 at 21:12
  • The tip with Adam Machanic's article/blog was the most helpful one yet. So if you create an answer with it, I would accept it. – Achim Sep 06 '11 at 08:48

3 Answers3

5

SqlDataReader does not read the whole dataset, you are confusing it with the Dataset class. It reads row by row, as the .Read() method is being called. If a client does not consume the resultset the server will suspend the query execution because it has no room to write the output into (the selected rows). Execution will resume as the client consumes more rows (SqlDataReader.Read is being called). There is even a special command behavior flag SequentialAccess that instructs the ADO.Net not to pre-load in memory the entire row, useful for accessing large BLOB columns in a streaming fashion (see Download and Upload images from SQL Server via ASP.Net MVC for a practical example).

You can have multiple active result sets (SqlDataReader) active on a single connection when MARS is active. However, MARS is incompatible with SQLCLR context connections.

So you can create a CLR streaming TVF to do some of what you need in CLR, but only if you have one single SQL query source. Multiple queries it would require you to abandon the context connection and use isntead a fully fledged connection, ie. connect back to the same instance in a loopback, and this would allow MARS and thus consume multiple resultsets. But loopback has its own issues as it breaks the transaction boundaries you have from context connection. Specifically with a loopback connection your TVF won't be able to read the changes made by the same transaction that called the TVF, because is a different transaction on a different connection.

Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • Remus: Can't you have the loopback connection join the already in-progress transaction from the context? Or would that still self-block? – RBarryYoung Sep 07 '11 at 03:33
  • @RBarryYoung Technically it can, using `sp_getbindtoken`/`sp_bindsession`. Not recommended at all. – Remus Rusanu Sep 07 '11 at 04:37
2

SQL is designed to work against huge data sets, and is extremely powerful. With set based logic it's often unnecessary to iterate over the data to perform operations, and there are a number of built-in ways to do this within SQL itself.

1) write set based logic to update the data without cursors

2) use deterministic User Defined Functions with set based logic (you can do this with the SqlFunction attribute in CLR code). Non-Deterministic will have the affect of turning the query into a cursor internally, it means the value output is not always the same given the same input.

[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static int algorithm(int value1, int value2)
{
    int value3 = ... ;
    return value3;
}

3) use cursors as a last resort. This is a powerful way to execute logic per row on the database but has a performance impact. It appears from this article CLR can out perform SQL cursors (thanks Martin).

I saw your comment that the complexity of using set based logic was too much. Can you provide an example? There are many SQL ways to solve complex problems - CTE, Views, partitioning etc.

Of course you may well be right in your approach, and I don't know what you are trying to do, but my gut says leverage the tools of SQL. Spawning multiple readers isn't the right way to approach the database implementation. It may well be that you need multiple threads calling into a SP to run concurrent processing, but don't do this inside the CLR.

To answer your question, with CLR implementations (and IDataReader) you don't really need to page results in chunks because you are not loading data into memory or transporting data over the network. IDataReader gives you access to the data stream row-by-row. By the sounds it your algorithm determines the amount of records that need updating, so when this happens simply stop calling Read() and end at that point.

SqlMetaData[] columns = new SqlMetaData[3];
columns[0] = new SqlMetaData("Value1", SqlDbType.Int);
columns[1] = new SqlMetaData("Value2", SqlDbType.Int);
columns[2] = new SqlMetaData("Value3", SqlDbType.Int);

SqlDataRecord record = new SqlDataRecord(columns);
SqlContext.Pipe.SendResultsStart(record);

SqlDataReader reader = comm.ExecuteReader();

bool flag = true;

while (reader.Read() && flag)
{
    int value1 = Convert.ToInt32(reader[0]);
    int value2 = Convert.ToInt32(reader[1]);

    // some algorithm 
    int newValue = ...;

    reader.SetInt32(3, newValue);        

    SqlContext.Pipe.SendResultsRow(record);

    // keep going?
    flag = newValue < 100;
 }
TheCodeKing
  • 19,064
  • 3
  • 47
  • 70
  • One potential example might be a running total type query. See [my answer here](http://stackoverflow.com/questions/6877507/when-are-tsql-cursors-the-best-or-only-option/6879770#6879770) for a graph showing how cursors rapidly out perform "set based" solutions. It is possible to implement a "faster cursor" using CLR as Adam Machanic discusses [here.](http://sqlblog.com/blogs/adam_machanic/archive/2006/07/12/running-sums-yet-again-sqlclr-saves-the-day.aspx) – Martin Smith Sep 04 '11 at 09:11
  • Interesting links thanks, of course it depends on the data/algorithm which is quicker through right (cursor vs set based)? The Adam Machanic results look really interesting, I'd love to see some more bench marking across different datasets. – TheCodeKing Sep 04 '11 at 09:31
0

Cursors are a SQL only function. If you wanted to read chunks of data at a time, some sort of paging would be required so that only a certain amount of the records would be returned. If using Linq,

.Skip(Skip)


.Take(PageSize)

Skips and takes could be used to limit results returned.

You can simply iterate over the DataReader by doing something like this:

using (IDataReader reader = Command.ExecuteReader())
{
    while (reader.Read())
    {
        //Do something with this record
    }
}

This would be iterating over the results one at a time, similiar to a cursor in SQL Server.

For multiple recordsets at once, try MARS (if SQL Server)

http://msdn.microsoft.com/en-us/library/ms131686.aspx

Jon Raynor
  • 3,804
  • 6
  • 29
  • 43
  • Loading/generating a result set into memory is not an option. And MARS is not available inside a CLR procedure, as far as I could figure out. Do you have a working example? – Achim Aug 22 '11 at 21:02
  • 1
    An IDataReader on an SQL command like "SELECT ... FROM ..." actually opens a read only fast forward cursor. It does **not** loads the whole result set in memory. Each record is read when the .Read() method is executed. This is the main difference with something like a DataSet which does load everything in memory. Of course it could depend on the underling ADO.NET provider, but with SQL Server, this is how it works. So it's a good solution. And BTW Linq's Skip() method does exactly that. – Simon Mourier Aug 31 '11 at 20:56
  • @Jon a for loop inside a SQL CLR method will act the same as a cursor. – Kirk Broadhurst Sep 07 '11 at 01:07