7

I have several programs written in R that now I need to translate in T-SQL to deliver them to the client. I am new to T-SQL and I'm facing some difficulties in translating all my R functions.

An example is the numerical derivative function, which for two input columns (values and time) would return another column (of same length) with the computed derivative.

My current understanding is:

  1. I can't use SP, because I'll need to use this functions inline with select statement, like: SELECT Customer_ID, Date, Amount, derivative(Amount, Date) FROM Customer_Detail

  2. I can't use UDF, because they can take, as input parameter, only scalar. I'll need vectorised function due to speed and also because for some functions I have, like the one above, running row by row wouldn't be meaningful (for each value it needs the next and the previous)

  3. UDA take whole column but, as the name says..., they will aggregate the column like sum or avg would.

If the above is correct, which other techniques would allow me to create the type of function I need? An example of SQL built-in function similar to what I'm after is square() which (apparently) takes a column and returns itself^2. My goal is creating a library of functions which behave like square, power, etc. But internally it'll be different cause square takes and returns each scalar is read through the rows. I would like to know if is possible to have User Defied with an accumulate method (like the UDA) able to operates on all the data at the end of the import and then return a column of the same length?

NB: At the moment I'm on SQL-Server 2005 but we'll switch soon to 2012 (or possibly 2014 in few months) so answers based on any 2005+ version of SQL-Server are fine.

EDIT: added the R tag for R developers who have, hopefully, already faced such difficulties.

EDIT2: Added CLR tag: I went through CLR user defined aggregate as defined in the Pro t-sql 2005 programmers guide. I already said above that this type of function wouldn't fit my needs but it was worth looking into it. The 4 methods needed by a UDA are: Init, Accumulate, Merge and Terminate. My request would need the whole data being analysed all together by the same instance of the UDA. So options including merge methods to group together partial results from multicore processing won't be working.

Michele
  • 8,563
  • 6
  • 45
  • 72
  • For clarification, if you have 100 rows in your table, and a column amount, is the result returned dependant on a single row's amount or all rows amount? Your example of the function square takes in a single value and return a single value. A function like sum processes multiple rows. Do you want a function like sum that processes multiple rows to produce a result, or like square that only needs one row to produce an output? – Vulcronos Sep 30 '13 at 18:30
  • @Vulcronos I have used the derivative example exactly for this. Yes, the result depends on all the value in the variables supplied to the functions. `is possible to have User Defied with an accumulate method (like the UDA) able to operates ... ? ` (I only used the square as an example of visible structure/inline use, I mentioned that). – Michele Sep 30 '13 at 22:51
  • 1
    For your "EDIT2": You can just implement Merge and Accumulate the same way: Store the data in some big, parallel data structure (e.g. http://danieltao.com/ConcurrentList/). When you have all the data, you do the real aggregation in Terminate. – stefan.schwetschke Oct 01 '13 at 15:03
  • @stefan.schwetschke makes sense. the issue is: if the c# class doesn't return a scalar (like in the UDA) will the query engine accept this output? – Michele Oct 01 '13 at 15:34
  • 1
    According to the documentation, an UDA should work with an UDT: " ::= system_scalar_type | { [ udt_schema_name. ] udt_type_name }" (http://technet.microsoft.com/en-us/library/ms182741.aspx) – stefan.schwetschke Oct 01 '13 at 16:19
  • @stefan.schwetschke interesting. could the `system_scalar_type` output be changed to a different one returning a vector, then read by sql as column? – Michele Oct 02 '13 at 10:40
  • 1
    I think you can do this by using a trick. You return a system scalar type (e.g. an integer), but this is only a handle. In the background you store the whole result in the .NET heap. You then use an UDF, giving it the handle and optionally a part of the original input. This function then maps this to the real return value. In other words: The UDF maps the column return value to each row of the result set. But this is quite a hack, this should be kind of a last resort solution. Especially resource management will be hard (freeing the intermediate column). – stefan.schwetschke Oct 07 '13 at 07:32

6 Answers6

4

I think you may consider changing your mind a bit. SQL language is very good when working with sets of data, especially modern RDBMS implementations (like SQL Server 2012), but you have to think in sets, not in rows or columns. While I stilldon't know your exact tasks, let's see - SQL Server 2012 have very nice set of window functions + ranking functions + analytic functions + common table expressions, so you can write almost any query inline. You can use chains of common table expression to turn your data any way you want, to calculate running totals, to calculate averages or other aggregates over window and so on.

Actually, I've always liked SQL and when I've learned functional language (ML and Scala) a bit, my thought was that my approach to SQL is very similar to functional language paradigm - just slicing and dicing data without saving anything into variables, untils you have resultset your need.

Just quick example, here's a question from SO - How to get average of the 'middle' values in a group?. The goal was to get the average for each group of the middle 3 values:

TEST_ID TEST_VALUE  GROUP_ID
1       5           1       -+
2       10          1        +- these values for group_id = 1
3       15          1       -+
4       25          2       -+
5       35          2        +- these values for group_id = 2
6       5           2       -+
7       15          2       
8       25          3
9       45          3       -+
10      55          3        +- these values for group_id = 3
11      15          3       -+
12      5           3
13      25          3
14      45          4       +- this value for group_id = 4

For me, it's not an easy task to do in R, but in SQL it could be a really simple query like this:

with cte as (
    select
        *,
        row_number() over(partition by group_id order by test_value) as rn,
        count(*) over(partition by group_id) as cnt
    from test
)
select
    group_id, avg(test_value)
from cte
where
    cnt <= 3 or
    (rn >= cnt / 2 - 1 and rn <= cnt / 2 + 1)
group by group_id

You can also easily expand this query to get 5 values around the middle.

TAke closer look to analytical functions, try to rethink your calculations in terms of window functions, may be it's not so hard to rewrite your R procedures in plain SQL.

Hope it helps.

Community
  • 1
  • 1
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • Hi, anything built-in probably won't fit my needs. I need to create my own library of functions via CLR, and what I'm asking is: if given a c# class taking multiple values, like UDA, but not returning just a scalar can such a function be accepted by the sql? I thought my question was clear, the derivative function is a (very simple) example of the type of function I need to use like: `select value, deriv(value, time) from customer`. Of course, as you suggest, I need to think in "sql way" and so here I'm asking you guys if the above is possible or not. Also the latter can be a valid answer. – Michele Sep 30 '13 at 23:02
  • btw you can just replicate the exact methodology of your solution in `R`: `setkey(test, GROUP_ID, TEST_VALUE); test[, rn := 1:.N, by=GROUP_ID]; test[, cnt := .N, by=GROUP_ID]; test[ cnt <= 3 | (rn >= cnt / 2 - 1 & rn <= cnt / 2 + 1), mean(TEST_VALUE), by=GROUP_ID]` Anyway it's out of scope. – Michele Sep 30 '13 at 23:17
  • 1
    It certainly looks as though **SQL Window functions** (available SQL 2005 onwards) should meet the need mentioned by @Michele: "I would like to know if is possible to have User Defied with an accumulate method (like the UDA) able to operates on all the data at the end of the import and then return a column of the same length?". – Krishna Gupta Oct 05 '13 at 00:06
  • 1
    Hi again... just found out that in `R` the built-in function `mean` does that already by itself ` mean(test_value, trim=1/10)`. the parameter `trim` will now exclude the 10% from the top and bottom. :) – Michele Oct 08 '13 at 22:52
3

I would solve this by passing a reference to the record(s) you want to process, and use so called "inline table-valued function" to return the record(s) after processing the initial records.

You find the table-function reference here: http://technet.microsoft.com/en-en/library/ms186755.aspx

A Sample:

    CREATE FUNCTION Sales.CustomerExtendedInfo (@CustomerID int)
RETURNS TABLE
AS
RETURN 
(
    SELECT FirstName + LastName AS CompleteName, 
           DATEDIFF(Day,CreateDate,GetDate()) AS DaysSinceCreation
    FROM Customer_Detail
    WHERE CustomerID = @CustomerID

);
GO

StoreID would be the Primary-Key of the Records you want to process.

Table-Function can afterwards be joined to other Query results if you want to process more than one record at once.

Here is a Sample:

SELECT  * FROM Customer_Detail
CROSS APPLY Sales.CustomerExtendedInfo (CustomerID) 

Using a normal Stored Procedure would do the same more or less, but it's a bit tricky to work with the results programmatically.

But keep one thing in mind: SQL-Server is not really good for "functional-programming". It's brilliant working with data and sets of data, but the more you use it as a "application server" the more you will realize it's not made for that.

  • will the above take a whole column as input? `@storeid` seems a scalar to me. – Michele Sep 26 '13 at 16:47
  • 1
    No, not a complete column, but StoreID would be the primary key to the column you want to process. You would then query inside the function for the row... You **could** work around it using XML that holds the record. But why not pass the reference and query for the record in the function? Using XML is described e.g. here: http://stackoverflow.com/questions/1609115/pass-table-as-parameter-into-sql-server-udf – Gregor Walter Sep 26 '13 at 17:21
  • understood your comment, thanks, but your answer still seems out of topic, because I said I need to invoke this function into a select statement against, say, `customer_detail` table. If this table has 1,000 rows, using your method, the server will will have to do 1,000,000 iterations (i.e. invoking the function for each row). Please tell me if I'm wrong. Also, for the example I mentioned above, the derivative, the function must have **all** the values at once, cause you can't do the derivative of one single point... – Michele Sep 26 '13 at 21:55
  • Anyway thanks again for answering and btw I agree with `SQL-Server is not really good for "functional-programming"`, but unfortunately it's what it seems I have to use... `R` rules :-) I kid... – Michele Sep 26 '13 at 21:56
  • It is called 1.000 Times, not 1 Million. I updated the sample above to include the OUTER APPLY / CROSS APPLY to Join the function to the Table. The performance of this is quite good (getting much better with SQL-Server 2008 or later instead of 2005 though), but sure, using functions is always quite a big overhead. – Gregor Walter Sep 27 '13 at 10:32
  • I meant that each function call (1000) will have to go through the 1000 rows... so 1 million. Anyway In repeat the problem is the the answer is not about the same topic of my question. I'm asking: given a column I need a function that **modifyes** that column and returns it, like `square`. Sorry I wasn't clear probably. – Michele Sep 27 '13 at 10:47
3

I don't think this is possible in pure T-SQL without using cursors. But with cursors, stuff will usually be very slow. Cursors are processing the table row-by/row, and some people call this "slow-by-slow".

But you can create your own aggregate function (see Technet for more details). You have to implement the function using the .NET CLR (e.g. C# or R.NET).

For a nice example see here.

I think interfacing R with SQL is a very nice solution. Oracle is offering this combo as a commercial product, so why not going the same way with SQL Server.

When integrating R in the code using the own aggregate functions, you will only pay a small performance penalty. Own aggregate functions are quite fast according to the Microsoft documentation: "Managed code generally performs slightly slower than built-in SQL Server aggregate functions". And the R.NET solution seems also to be quite fast by loading the native R DLL directly in the running process. So it should be much faster than using R over ODBC.

stefan.schwetschke
  • 8,862
  • 1
  • 26
  • 30
  • Hi thanks for answering. Integrating R with SQL is possible via ODBC (I'm currently doing so). However due to high amount of data (100M+ rows, 50+ cols) in each table this option is not doable for near real time analytics (besides any licensing issue when deploying as SaaS). Thanks for the example, I'll read it all. However it seems we're still on aggregate output. My idea was: 1) the `Accumulate` method stores each value read through the rows, 2) doing something, 3) returning a vector with length=number_of_rows_read. The `c#` sice is fine, but can this type of assembly be deployed in SQL? – Michele Oct 01 '13 at 13:06
  • regarding R and DB integration it seems that the only (production) useful work has been done by Teradata and Revolution Analytics. Where the R interpreter is claimed to run in-database, without the need of copying and moving data from the DB to the R Server (like in `Oracle` and `SAP Hana` via `RServe`) – Michele Oct 01 '13 at 13:10
  • @Michele: It looks like the solution I described should work similar to the Teradata solution you describe: The R DLL should be loaded directly in the database server process. Could not test it yet, but based on the documentation it should work this way. – stefan.schwetschke Oct 01 '13 at 14:53
  • Oh yes `R.NET` is fine but, as I understand, it can link `c#` and `R`. Then you'll still need to send back the result to the query engine. I didn't mention `R.NET` cause I can translate `R` into `c#` quite easily. The big deal is having a user defined in Sql that takes whole columns, sends to c#, and takes whole columns as the result. That's why an aggregate won't work – Michele Oct 01 '13 at 15:41
2

ORIGINAL RESPONSE:

if you know already what are the functions you will need one of the approach I can think of is, creating one In-Line function for each method/operation you want to apply per table. what I mean by that? for example you mentioned FROM Customer_Detail table when you select you might want need one method "derivative(Amount, Date)". let's say second method you might need (I am just making up for explanation) is "derivative1(Amount1, Date1)". we create two In-Line Functions, each will do its own calculation inside function on intended columns and also returns remaining columns as it is. that way you get all columns as you get from table and also perform custom calculation as a set-based operation instead scalar operation. later you can combine the Independent calculation of columns in same function if make sense. you can still use this all functions and do JOIN to get all custom calculation in single set if needed as all functions will have common/unprocessed columns coming as it is. see the example below.

    IF object_id('Product','u') IS NOT NULL
          DROP TABLE Product
    GO
    CREATE TABLE Product
    (
          pname       sysname NOT NULL
          ,pid        INT         NOT NULL
          ,totalqty   INT         NOT NULL DEFAULT 1
          ,uprice           NUMERIC(28,10)    NOT NULL DEFAULT 0
    )
    GO
    INSERT INTO Product( pname, pid, totalqty, uprice )
                      SELECT      'pen',1,100,1.2
    UNION ALL   SELECT      'book',2,300,10.00
    UNION ALL   SELECT      'lock',3,500,15.00
    GO

    IF object_id('ufn_Product_totalValue','IF') IS NOT NULL
          DROP FUNCTION ufn_Product_totalValue
    GO
    CREATE FUNCTION ufn_Product_totalValue
    (
          @newqty           int
          ,@newuprice numeric(28,10)
    )
    RETURNS TABLE AS
    RETURN
    (
          SELECT pname,pid,totalqty,uprice,totalqty*uprice AS totalValue
          FROM
          (
                SELECT 
                            pname
                            ,pid
                            ,totalqty+@newqty AS totalqty
                            ,uprice+@newuprice AS uprice
                FROM Product
          )qry
    )
    GO

    IF object_id('ufn_Product_totalValuePct','IF') IS NOT NULL
          DROP FUNCTION ufn_Product_totalValuePct
    GO
    CREATE FUNCTION ufn_Product_totalValuePct
    (
          @newqty           int
          ,@newuprice numeric(28,10)
    )
    RETURNS TABLE AS
    RETURN
    (
          SELECT pname,pid,totalqty,uprice,totalqty*uprice/100 AS totalValuePct
          FROM
          (
                SELECT 
                            pname
                            ,pid
                            ,totalqty+@newqty AS totalqty
                            ,uprice+@newuprice AS uprice
                FROM Product
          )qry
    )
    GO

    SELECT * FROM ufn_Product_totalValue(10,5)

    SELECT * FROM ufn_Product_totalValuepct(10,5)

    select tv.pname,tv.pid,tv.totalValue,pct.totalValuePct
    from ufn_Product_totalValue(10,5) tv
    join ufn_Product_totalValuePct(10,5) pct
        on tv.pid=pct.pid

also check the output as shown below. enter image description here

EDIT2:

three point smoothing Algorithms

enter image description here

    IF OBJECT_ID('Test3PointSmoothingAlgo','u') IS NOT NULL
        DROP TABLE Test3PointSmoothingAlgo
    GO
    CREATE TABLE Test3PointSmoothingAlgo
    (
        qty INT NOT NULL
        ,id INT IDENTITY NOT NULL
    )
    GO
    INSERT Test3PointSmoothingAlgo( qty ) SELECT 10 UNION SELECT 20 UNION SELECT 30
    GO

    IF object_id('ufn_Test3PointSmoothingAlgo_qty','IF') IS NOT NULL
          DROP FUNCTION ufn_Test3PointSmoothingAlgo_qty
    GO
    CREATE FUNCTION ufn_Test3PointSmoothingAlgo_qty
    (
        @ID INT --this is a dummy parameter
    )
    RETURNS TABLE AS
    RETURN
    (
        WITH CTE_3PSA(SmoothingPoint,Coefficients)
        AS --finding the ID of adjacent points
        (
            SELECT id,id
            FROM Test3PointSmoothingAlgo
            UNION
            SELECT id,id-1
            FROM Test3PointSmoothingAlgo
            UNION
            SELECT id,id+1
            FROM Test3PointSmoothingAlgo 
        )
        --Apply 3 point Smoothing algorithms formula
        SELECT a.SmoothingPoint,SUM(ISNULL(b.qty,0))/3 AS Qty_Smoothed--this is a using 3 point smoothing algoritham formula
        FROM CTE_3PSA a
        LEFT JOIN Test3PointSmoothingAlgo b
        ON a.Coefficients=b.id
        GROUP BY a.SmoothingPoint
    )
    GO

    SELECT SmoothingPoint,Qty_Smoothed FROM dbo.ufn_Test3PointSmoothingAlgo_qty(NULL)

enter image description here

Anup Shah
  • 1,256
  • 10
  • 15
  • Hi thanks for the details but it's very different from the operations I need. My function performs operations on **WHOLE** columns and need to return a **WHOLE** column. The result depends on all the points, so you need to provide the function with all the rows at once. If you provide me examples of, say, a first derivative and a [three point smooothing](http://terpconnect.umd.edu/~toh/spectrum/Smoothing.html) using in-line syntax I'll accept the answer – Michele Oct 03 '13 at 12:00
  • Ok, previously I didn’t get WHOLE COLUMN input and WHOLE COLUM output. But when I look at the example of “three point smoothing Algorithms” and pick the simplest version of it I kind of got it. I have updated the answer by appending new code for “three point smoothing Algorithms”. If you feel it is help full then also let me know what is “derivative(Amount, Date)” does? If I see example I can think of something. – Anup Shah Oct 03 '13 at 17:05
  • Hi thanks, yes the above does the job but it isn't the kind of function I wanted. It will only work with `Test3PointSmoothingAlgo` table (at least it will in the way it is defined now). I create functions to apply the same operations (like smoothing) to any (numeric) column in any table. Can the table and column names be supplied as function parameters? – Michele Oct 03 '13 at 21:56
  • no i don't think so we can achive that nicely. from the first answer i posted i gave the impression that you will should be ok with creating individual functions for each custom calc. genaric InlineFuntion is not possible as SQL deos not suport dynamic SQL call or column as parameter replacments inside it. My choice for InlineFunction was due to performance reasons even thoug the are strict in what we can do in side it. Can we think other type of objects that can handle same thing more Generically? May be yes but I doubt it it will be that clean and perfeorct as well. – Anup Shah Oct 04 '13 at 18:04
1

I think you may need to break you functionalities into two parts - into UDA which can work on scopes thank to OVER (...) clause and formulas which combine the result scalars.

What you are asking for - to define objects in such a way as to make it a aggregate/scalar combo - is probably out of scope of regular SQL Server's capabilities, unless you fall back into CLR code the effectively would be equivalent to cursor in terms of performance or worse.

Your best shot is to probably defined SP (I know you don't what that) that will produce the whole result. Like create [derivative] stored procedure that will take in parameters with table and column names as parameters. You can even expand on the idea but in the end that's not what you want exactly.

nimdil
  • 1,361
  • 10
  • 20
  • an SP that can work with any table and column names (having them as parameter) would be a better solution than a in-linr table valued function which works only with a pre defined table and column – Michele Oct 04 '13 at 13:20
  • Yea, so you can just make something like `sp_derivative` with `@table nvarchar(128), @param1col nvarchar(128), @param2col nvarchar(128)`, then define `@sql nvarchar(max)` and built a query around the formula. It's possible it may be worth to run some aggregations first, but I think in many cases you can avoid it (worth to check the performance, though). Then just go for `exec sp_executesql @sql` and you have the result. Keep in mind that the result of the procedure can be inserted into table so that later it can be part of the query: `insert into ... exec sp_derivative ...` – nimdil Oct 07 '13 at 08:40
1

Since you mention you will be upgrading to SQL Server 2012 - SQL Server 2008 introduced Table Valued Parameters

This feature will do what you want. You will have to define a User Defined Type (UDT) in your DB which is like a table definition with columns & their respective types.

You can then use that UDT as a parameter type for any other stored procedure or function in your DB.

You can combine these UDTs with CLR integration to achieve what you require.

As mentioned SQL is not good when you are comparing rows to other rows, it's much better at set based operations where every row is treated as an independent entity. But, before looking at cursors & CLR, you should make sure it can't be done in pure TSQL which will almost always be faster & scale better as your table grows.

One method for comparing rows based on order is wrap your data in a CTE, adding a ranking function like ROW_NUMBER to set the row order, followed by a self-join of the CTE onto itself.

The join will be performed on the ordered field e.g. ROW_NUMBER=(ROW_NUMBER-1)

Look at this article for an example

Ashley Pillay
  • 868
  • 4
  • 9