Summary of quantiles with R script in SQL Server + data.table + rxDataStep

Question

I'm running a script in SQL Server that calculates quantiles and a few other "custom" metrics in chunks via rxDataStep.

DECLARE @Rscript nvarchar(max) = N'

library("data.table")

connStr <- "Driver={ODBC Driver 13 for SQL Server};Server=TOMAZK\MSSQLSERVER2017;Database=AdventureWorks;Trusted_Connection=Yes"
query <- "SELECT [LOTNAME] --varchar
      ,[WAFERNUMBER] --varchar
      ,[SLOT]--varchar
      ,[RECIPE] --varchar
      ,[MODULENAME] --varchar
      ,[LOT_START] --datetime
      ,[STEP_IN] --datetime
      ,[STEPNAME] --varchar
      ,[MVName] AS MVNAME --varchar
      ,[MaxVal] --FLOAT
      ,[WAFERNAME]--varchar
      ,[SOURCE] FROM dbo.TABLE 


    outTabName <- "STAGE_TABLE"

    quant_func <- function(data){
  setDT(data)

  data[,.(Q1 = quantile(MaxVal, 0.25), Q3 = quantile(MaxVal, 0.75), MEDIANVAL = quantile(MaxVal, 0.50), IQR = quantile(MaxVal, 0.75)-quantile(MaxVal, 0.25),ROBUST_SIGMA = (quantile(MaxVal, 0.75)-quantile(MaxVal, 0.25))/1.35,MAX_STEP_IN = max(STEP_IN)), .(LOTNAME, STEPNAME,  WAFERNUMBER, WAFERNAME,LOT_START,RECIPE,MVNAME,SOURCE,MODULENAME)]};
    inTabDS <- RxSqlServerData(sqlQuery = query, connectionString =connStr)
    outTabDS <- RxSqlServerData(table = outTabName, connectionString = connStr)

rxDataStep(inData = inTabDS, outFile = outTabDS, maxRowsByCols = NULL, rowsPerRead = 500000, transformFunc = quant_func, transformPackages ="data.table", overwrite = TRUE)'





EXEC sp_execute_external_script @language = N'R'
      , @script = @Rscript
WITH result sets none;
GO

For some reason I keep getting more than 1 "STEP_IN" value after the summary instead of getting the MAX value in each group as intended. I believe that it is so because the way rxDataStep works "chunking" data while processing. I'm under the impression that the solution lies somewhere in my transformFunc (quant_func), perhaps missing a for loop to accurately select the max(STEP_IN) despite the chunking but I'm not sure how to implement that... Any help is greatly appreciated.

PS1: I'm using rxDataStep because there's over 1 billion rows to read, so it wouldn't fit in memory otherwise.

PS2: I'm using data.table library because so far has the best performance when compared to base r and/or dplyr for example.

UPDATE: My code works just fine when the number of rows in the input data is <= the rowsPerRead(500000) value I set in rxDataStep. So it's clear that the problem is with the chunking then. Just need to know how to handle that behavior in my TransformFunc.

Your code is incomplete, perhaps just a copy/paste error. Namely, `query` has no closing quote, and your `connStr` isn't escaping your internal double-quotes. While this might be a problem with my understanding of how you intend to use this code, the code as you've presented it here is syntactically broken. — r2evans, Jan 26 '20 at 21:56
Hi, That's just a copy/paste error. My code works when I subset the input data in the query, the problem is when I don't, i'm not able to get just 1 max(STEP_IN) value per group. As for how I intend to use this code, the idea is quite simple actually: - Bring input data (over 1 billion rows) from a SQL table - Calculate quantiles(Q1, Q3,etc..) and a few other values such as max(STEP_IN) - Write the output to another SQL table. — KemuFI, Jan 27 '20 at 06:40
How (or why?) do you expect anybody to help with your code if you know it is syntactically wrong for no better reason than a copy/paste error? How does `connStr <- "Driver=SQL Server;Server="SERVER_NAME"...` not error right away? Providing reduced code for the problem is good and important (the **m**inimal in mwe), sure, but it is also helpful to have a *reproducible problem*. — r2evans, Jan 27 '20 at 14:54
I see. FYI, you did not need to replace it with the actual names, though that does clear up a few things ... you could have just fixed that *first* quote problem using something like `connStr <- "Driver={DRVR_TYPE};Server=SERVER_NAME;..."`, note that I both start and finish the *string* with the double quotes. You still have not corrected the rest of your code, **it is broken**, missing at least one more double-quote. And while I can see where I think it should be placed, the point is that the code you asked us to evaluate is broken and only because you present it that way. Good luck, KemuFI. — r2evans, Jan 27 '20 at 19:06

Summary of quantiles with R script in SQL Server + data.table + rxDataStep

0 Answers0