I'm running a script in SQL Server that calculates quantiles and a few other "custom" metrics in chunks via rxDataStep.
DECLARE @Rscript nvarchar(max) = N'
library("data.table")
connStr <- "Driver={ODBC Driver 13 for SQL Server};Server=TOMAZK\MSSQLSERVER2017;Database=AdventureWorks;Trusted_Connection=Yes"
query <- "SELECT [LOTNAME] --varchar
,[WAFERNUMBER] --varchar
,[SLOT]--varchar
,[RECIPE] --varchar
,[MODULENAME] --varchar
,[LOT_START] --datetime
,[STEP_IN] --datetime
,[STEPNAME] --varchar
,[MVName] AS MVNAME --varchar
,[MaxVal] --FLOAT
,[WAFERNAME]--varchar
,[SOURCE] FROM dbo.TABLE
outTabName <- "STAGE_TABLE"
quant_func <- function(data){
setDT(data)
data[,.(Q1 = quantile(MaxVal, 0.25), Q3 = quantile(MaxVal, 0.75), MEDIANVAL = quantile(MaxVal, 0.50), IQR = quantile(MaxVal, 0.75)-quantile(MaxVal, 0.25),ROBUST_SIGMA = (quantile(MaxVal, 0.75)-quantile(MaxVal, 0.25))/1.35,MAX_STEP_IN = max(STEP_IN)), .(LOTNAME, STEPNAME, WAFERNUMBER, WAFERNAME,LOT_START,RECIPE,MVNAME,SOURCE,MODULENAME)]};
inTabDS <- RxSqlServerData(sqlQuery = query, connectionString =connStr)
outTabDS <- RxSqlServerData(table = outTabName, connectionString = connStr)
rxDataStep(inData = inTabDS, outFile = outTabDS, maxRowsByCols = NULL, rowsPerRead = 500000, transformFunc = quant_func, transformPackages ="data.table", overwrite = TRUE)'
EXEC sp_execute_external_script @language = N'R'
, @script = @Rscript
WITH result sets none;
GO
For some reason I keep getting more than 1 "STEP_IN" value after the summary instead of getting the MAX value in each group as intended. I believe that it is so because the way rxDataStep works "chunking" data while processing. I'm under the impression that the solution lies somewhere in my transformFunc (quant_func), perhaps missing a for loop to accurately select the max(STEP_IN) despite the chunking but I'm not sure how to implement that... Any help is greatly appreciated.
PS1: I'm using rxDataStep because there's over 1 billion rows to read, so it wouldn't fit in memory otherwise.
PS2: I'm using data.table library because so far has the best performance when compared to base r and/or dplyr for example.
UPDATE: My code works just fine when the number of rows in the input data is <= the rowsPerRead(500000) value I set in rxDataStep. So it's clear that the problem is with the chunking then. Just need to know how to handle that behavior in my TransformFunc.