SQL/SAS equivalent code in Microsoft-r

Question

I wanted to convert one of my Proc SQL/SAS code in Rev R/Microsoft-r

here is my sample code

proc sql;
create table GENDER_YEAR as
select YEAR,GENDER,count(distinct CARD_NO) as CM_COUNT,sum(SPEND) as TOTAL_SPEND, sum(case when SPEND GT 0 then 1 else 0 end) as NO_OF_TRANS
from ABC group by YEAR,GENDER;
quit;

I'm trying below code in Rev R

library("RevoPemaR")
byGroupPemaObj <- PemaByGroup()

GENDER_cv_grouped <- pemaCompute(pemaObj = byGroupPemaObj, data = Merchant_Trans,groupByVar = "GENDER",computeVars = c("LOCAL_SPEND"),fnList = list(sum = list(FUN = sum, x = NULL)))

it Calculate only on thing at a time, but i need Distinct Count of CARD_NO, SUM of SPEND, and No of no zero Rows for Spend as Trans for each segment of YEAR & Gender.

Output Should look like below

YEAR  GENDER CM_COUNT TOTAL_SPEND NO_OF_TRANS
YEAR1 M      23        120        119
YEAR1 F      21        110        110
YEAR2 M      20        121        121
YEAR2 F      35        111        109

Looking forward help on this.

Welcome to Stack Overflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. — zx8754, Dec 14 '16 at 07:42
Have you tried the package https://github.com/RevolutionAnalytics/dplyrXdf ? It has the exact same syntax as dplyr but works on XDF files. Also, why not concatenate both those columns into 1 column with a comma separation and group by that ? — AbdealiLoKo, Jan 04 '17 at 03:57

score 0 · Answer 1 · answered Jan 04 '17 at 04:17

The easiest way to go about this it to concatenate the columns into a single column and use that. It seems that most both dplyrXdf and RevoPemaR do not support group by with 2 variables yet.

The way to do this would be by adding a rxDataStep on top which creates this variable first and then groups by it. Some approximate code for this is:

library("RevoPemaR")
byGroupPemaObj <- PemaByGroup()

rxDataStep(inData = Merchant_Trans, outFile = Merchant_Trans_Groups,
           transform = list(year_gender = paste(YEAR, GENDER,))

GENDER_cv_grouped <- pemaCompute(pemaObj = byGroupPemaObj,
    data = Merchant_Trans_Groups, groupByVar = "GENDER",
    computeVars = c("LOCAL_SPEND"),
    fnList = list(sum = list(FUN = sum, x = NULL)))

Note that overall there are 3 methods of doing a groupBy in RevR as far as I know. Each has it's pros and cons.

rxSplit - This actually creates different XDF files for each group that you want. This can be used with the splitByFactor arg where the factor specifies which groups should be created.
RevoPemaR's PemaByGroup - This assumes that each group's data can be stored in RAM. Which is a fair assumption. It also needs the original Xdf file to be sorted by the GroupBy column. And it only supports grouping by 1 column.
dplyrXdf's group_by - This is a spin on the popular dplyr package. It has many variable manipulation methods - so a different way to write rxSplit and rxDataStep) using dplyr like syntax. It also only supports 1 column to group with.

All three methods currently only support a single variable group operation. Hence they all require some pre processing of the data to work with.

Actually, dplyrXdf should work fine with multiple grouping variables (if it doesn't, please submit a bug report). The real issue here is that `count distinct` isn't a natively supported operation for xdf files and requires a [workaround.](http://stackoverflow.com/questions/36328996/count-distinct-in-a-rxsummary) — Hong Ooi, Jan 04 '17 at 06:09

Hong Ooi · Answer 2 · 2017-01-06T12:34:06.403

Here's a simple solution using dplyrXdf. Unlike with data frames, the n_distinct() summary function provided by dplyr doesn't work with xdf files, so this does a two-step summarisation: first including card_no as a grouping variable, and then count the number of card_no's.

First, generate some example data:

library(dplyrXdf)     # also loads dplyr
set.seed(12345)

df <- expand.grid(year=2000:2005, gender=c("F", "M")) %>%
    group_by(year, gender) %>%
    do(data.frame(card_no=sample(20, size=10, replace=TRUE),
                  spend=rbinom(10, 1, 0.5) * runif(10) * 100))

xdf <- rxDataStep(df, "ndistinct.xdf", overwrite=TRUE)

Now call summarise twice, taking advantage of the fact that the first summarise will remove card_no from the list of grouping variables:

smry <- xdf %>%
    mutate(trans=spend > 0) %>%
    group_by(year, gender, card_no) %>%
    summarise(n=n(), total_spend=sum(spend), no_of_trans=sum(trans)) %>%
    summarise(cm_count=n(), total_spend=sum(total_spend), no_of_trans=sum(no_of_trans))

as.data.frame(smry)    

   #year gender cm_count total_spend no_of_trans
#1  2000      F       10   359.30313           6
#2  2001      F        8   225.89571           3
#3  2002      F        7   332.58365           6
#4  2003      F        5   333.72169           5
#5  2004      F        7   280.90448           5
#6  2005      F        9   254.37680           5
#7  2000      M        8   309.77727           6
#8  2001      M        8   143.70835           2
#9  2002      M        8   269.64968           5
#10 2003      M        8   265.27049           4
#11 2004      M        9    99.73945           3
#12 2005      M        8   178.12686           6

Verify that this is the same result (modulo row ordering) as you'd get by running a dplyr chain on the original data frame:

df %>%
    group_by(year, gender) %>%
    summarise(cm_count=n_distinct(card_no), total_spend=sum(spend), no_of_trans=sum(spend > 0)) %>%
    arrange(gender, year)

    #year gender cm_count total_spend no_of_trans
   #<int> <fctr>    <int>       <dbl>       <int>
#1   2000      F       10   359.30313           6
#2   2001      F        8   225.89571           3
#3   2002      F        7   332.58365           6
#4   2003      F        5   333.72169           5
#5   2004      F        7   280.90448           5
#6   2005      F        9   254.37680           5
#7   2000      M        8   309.77727           6
#8   2001      M        8   143.70835           2
#9   2002      M        8   269.64968           5
#10  2003      M        8   265.27049           4
#11  2004      M        9    99.73945           3
#12  2005      M        8   178.12686           6

SQL/SAS equivalent code in Microsoft-r

2 Answers2