3

Currently I'm using the reshape library to pivot data in R, but it seems to struggle when i many columns(4000+). Is there any Multithreaded alternative to this function (similar to RevoScaleR package by MS) or any better way to do this?

Here is an example of the Code that i have right now:

DROP TABLE IF EXISTS #DummyData
CREATE TABLE #DummyData
(
     [VariableA] VARCHAR(24)
    ,[VariableB] VARCHAR(24)
    ,[Value] SMALLINT
)

INSERT INTO #DummyData([VariableA], [VariableB], [Value])
VALUES   ('A1','B1', 4)
        ,('A1','B2', 3)
        ,('A1','B3', 1)
        ,('A2','B1', 2)
        ,('A2','B2', 1)
        ,('A2','B3', 3)
        ,('A3','B1', 4)
        ,('A3','B2', 5)
        ,('A3','B3', 2);

EXECUTE sp_execute_external_script    
      @language = N'R'   
    , @script = N'
        library(reshape)
        pivotData <- cast(DataIn, VariableA ~ VariableB  ,fun.aggregate = max)
        DataOut <- pivotData
    '   
    , @input_data_1 = N'SELECT [VariableA], [VariableB], [Value] FROM #DummyData'
    , @input_data_1_name  = N'DataIn'   
    , @output_data_1_name =  N'DataOut';

This query returs the following result:

enter image description here

gotqn
  • 42,737
  • 46
  • 157
  • 243
Mincho Minchev
  • 222
  • 3
  • 13
  • SQL Server uses Revo R. Besides, what are you trying to do? Do you really have a table with 4000 columns? Or are you trying to convert 4000 rows into just 3? – Panagiotis Kanavos Jul 27 '18 at 11:52
  • SQL Server is multithreaded and fast. You can use *SQL Server* to aggregate the data, grouping by VarA, VarB and transpose the result into a 4K-column wide dataframe. – Panagiotis Kanavos Jul 27 '18 at 11:54
  • @PanagiotisKanavos I doubt that creating `PIVOT` with 4000+ columns using T-SQL will be faster then some `R` function. – gotqn Jul 27 '18 at 12:08
  • The reason that I make the pivot in R is that I will make additional calculations with this data in R and i need it in this format. I've tried to pivot the data before passing it to the R script, but it was way slower that way. – Mincho Minchev Jul 27 '18 at 13:01

1 Answers1

0

Pivot in sql server maps internally to group by. There are many optimizations that have been added into sql server’s optimizer over the years including infra query parallelism. Please give it a try. The start up cost for external scripts is pretty large so you may find staying native to be a better plan in this case.

Conor Cunningham MSFT
  • 4,151
  • 1
  • 15
  • 21
  • The reason that I make the pivot in R is that I will make additional calculations with this data in R and i need it in this format. I've tried to pivot the data before passing it to the R script, but it was way slower that way. – Mincho Minchev Jul 30 '18 at 07:10
  • If you end up with lots of columns, the transport over to R may become a larger issue since it is a column-oriented compressed transport format. If you pivot on the SQL side, the data volume may be the same but the number of columns may be larger/compression may not be as good. So, I could see how this might become an issue. – Conor Cunningham MSFT Jul 31 '18 at 15:15