0

I'm running a script in SQL Server that calculates quantiles and a few other "custom" metrics in chunks via rxDataStep.

DECLARE @Rscript nvarchar(max) = N'

library("data.table")

connStr <- "Driver={ODBC Driver 13 for SQL Server};Server=TOMAZK\MSSQLSERVER2017;Database=AdventureWorks;Trusted_Connection=Yes"
query <- "SELECT [LOTNAME] --varchar
      ,[WAFERNUMBER] --varchar
      ,[SLOT]--varchar
      ,[RECIPE] --varchar
      ,[MODULENAME] --varchar
      ,[LOT_START] --datetime
      ,[STEP_IN] --datetime
      ,[STEPNAME] --varchar
      ,[MVName] AS MVNAME --varchar
      ,[MaxVal] --FLOAT
      ,[WAFERNAME]--varchar
      ,[SOURCE] FROM dbo.TABLE 


    outTabName <- "STAGE_TABLE"

    quant_func <- function(data){
  setDT(data)

  data[,.(Q1 = quantile(MaxVal, 0.25), Q3 = quantile(MaxVal, 0.75), MEDIANVAL = quantile(MaxVal, 0.50), IQR = quantile(MaxVal, 0.75)-quantile(MaxVal, 0.25),ROBUST_SIGMA = (quantile(MaxVal, 0.75)-quantile(MaxVal, 0.25))/1.35,MAX_STEP_IN = max(STEP_IN)), .(LOTNAME, STEPNAME,  WAFERNUMBER, WAFERNAME,LOT_START,RECIPE,MVNAME,SOURCE,MODULENAME)]};
    inTabDS <- RxSqlServerData(sqlQuery = query, connectionString =connStr)
    outTabDS <- RxSqlServerData(table = outTabName, connectionString = connStr)

rxDataStep(inData = inTabDS, outFile = outTabDS, maxRowsByCols = NULL, rowsPerRead = 500000, transformFunc = quant_func, transformPackages ="data.table", overwrite = TRUE)'





EXEC sp_execute_external_script @language = N'R'
      , @script = @Rscript
WITH result sets none;
GO 

For some reason I keep getting more than 1 "STEP_IN" value after the summary instead of getting the MAX value in each group as intended. I believe that it is so because the way rxDataStep works "chunking" data while processing. I'm under the impression that the solution lies somewhere in my transformFunc (quant_func), perhaps missing a for loop to accurately select the max(STEP_IN) despite the chunking but I'm not sure how to implement that... Any help is greatly appreciated.

PS1: I'm using rxDataStep because there's over 1 billion rows to read, so it wouldn't fit in memory otherwise.

PS2: I'm using data.table library because so far has the best performance when compared to base r and/or dplyr for example.

UPDATE: My code works just fine when the number of rows in the input data is <= the rowsPerRead(500000) value I set in rxDataStep. So it's clear that the problem is with the chunking then. Just need to know how to handle that behavior in my TransformFunc.

KemuFI
  • 23
  • 5
  • Your code is incomplete, perhaps just a copy/paste error. Namely, `query` has no closing quote, and your `connStr` isn't escaping your internal double-quotes. While this might be a problem with my understanding of how you intend to use this code, the code as you've presented it here is syntactically broken. – r2evans Jan 26 '20 at 21:56
  • Hi, That's just a copy/paste error. My code works when I subset the input data in the query, the problem is when I don't, i'm not able to get just 1 max(STEP_IN) value per group. As for how I intend to use this code, the idea is quite simple actually: - Bring input data (over 1 billion rows) from a SQL table - Calculate quantiles(Q1, Q3,etc..) and a few other values such as max(STEP_IN) - Write the output to another SQL table. – KemuFI Jan 27 '20 at 06:40
  • How (or why?) do you expect anybody to help with your code if you know it is syntactically wrong for no better reason than a copy/paste error? How does `connStr <- "Driver=SQL Server;Server="SERVER_NAME"...` not error right away? Providing reduced code for the problem is good and important (the **m**inimal in mwe), sure, but it is also helpful to have a *reproducible problem*. – r2evans Jan 27 '20 at 14:54
  • There you go, added sample server/database name. – KemuFI Jan 27 '20 at 18:41
  • I see. FYI, you did not need to replace it with the actual names, though that does clear up a few things ... you could have just fixed that *first* quote problem using something like `connStr <- "Driver={DRVR_TYPE};Server=SERVER_NAME;..."`, note that I both start and finish the *string* with the double quotes. You still have not corrected the rest of your code, **it is broken**, missing at least one more double-quote. And while I can see where I think it should be placed, the point is that the code you asked us to evaluate is broken and only because you present it that way. Good luck, KemuFI. – r2evans Jan 27 '20 at 19:06

0 Answers0