4

I'm looking for some R-code, which produces same output as a proc summary in SAS can do.

I've looked in this thread, which have somewhat similar issue: R: calculating column sums & row sums as an aggregation from a dataframe, but my problem is slightly different, since I don't want to sum across columns, but rather rows.

My data has the structure, which is shown here:

   Flag1   Flag2  Flag3   Type1 Type2 Type3
1     Level1    A    FIRST     2     0    0
2     Level1    A    SECOND    1     9    0
3     Level1    A    THIRD     3     7    0
4     Level1    A    FOURTH    9    18    0
5     Level1    A    FIFTH     1    22    0
6     Level1    A    SIXTH     1    13    0
7     Level1    B    FIRST     0     0    0
8     Level1    B    SECOND    3     9    0
9     Level1    B    THIRD     5    85    0
10    Level1    B    FOURTH    4    96    0
11    Level1    B    FIFTH     3    40    0
12    Level1    B    SIXTH     0    17    0
22    Level2    A    FIRST     2     0    0
23    Level2    A    SECOND    1     9    0
24    Level2    A    THIRD     3     7    0
25    Level2    A    FOURTH    9    18    0
26    Level2    A    FIFTH     1    22    0
27    Level2    A    SIXTH     1    13    0
28    Level2    B    FIRST     0     0    0
29    Level2    B    SECOND    3     9    0
30    Level2    B    THIRD     5    85    0
31    Level2    B    FOURTH    4    96    0
32    Level2    B    FIFTH     3    40    0
33    Level2    B    SIXTH     0    17    0
34    Level3    A    FIRST     2     0    0
35    Level3    A    SECOND    1     9    0
36    Level3    A    THIRD     3     7    0
37    Level3    A    FOURTH    9    18    0
38    Level3    A    FIFTH     1    22    0
39    Level3    A    SIXTH     1    13    0
40    Level3    B    FIRST     0     0    0
41    Level3    B    SECOND    3     9    0
42    Level3    B    THIRD     5    85    0
43    Level3    B    FOURTH    4    96    0
44    Level3    B    FIFTH     3    40    0
45    Level3    B    SIXTH     0    17    0

I would like to have the result look something like this, where the Sum could be sum of type1 for example.

    Flag1   Flag2   Flag3   Sum
1   Level1  A   FIRST   
2   Level1  A   SECOND  
3   Level1  A   THIRD   
4   Level1  A   FOURTH  
5   Level1  A   FIFTH   
6   Level1  A   SIXTH   
7   Level1  B   FIRST   
8   Level1  B   SECOND  
9   Level1  B   THIRD   
10  Level1  B   FOURTH  
11  Level1  B   FIFTH   
12  Level1  B   SIXTH  
13  Level1  (all)   FIRST   
14  Level1  (all)   SECOND  
15  Level1  (all)   THIRD   
16  Level1  (all)   FOURTH  
17  Level1  (all)   FIFTH   
18  Level1  (all)   SIXTH   
19  Level1  A   (all)   
20  Level1  B   (all)   
21  Level1  (all)   (all)  
22  Level2  A   FIRST   
23  Level2  A   SECOND  
24  Level2  A   THIRD   
25  Level2  A   FOURTH  
26  Level2  A   FIFTH   
27  Level2  A   SIXTH   
28  Level2  B   FIRST   
29  Level2  B   SECOND  
30  Level2  B   THIRD   
31  Level2  B   FOURTH  
32  Level2  B   FIFTH   
33  Level2  B   SIXTH   
34  Level2  (all)   FIRST 
35  Level2  (all)   SECOND
36  Level2  (all)   THIRD 
37  Level2  (all)   FOURTH
38  Level2  (all)   FIFTH 
39  Level2  (all)   SIXTH 
40  Level2  A   (all)   
41  Level2  B   (all)   
42  Level2  (all)   (all)   
43  Level3  A   FIRST   
44  Level3  A   SECOND  
45  Level3  A   THIRD   
46  Level3  A   FOURTH  
47  Level3  A   FIFTH   
48  Level3  A   SIXTH   
49  Level3  B   FIRST   
50  Level3  B   SECOND  
51  Level3  B   THIRD   
52  Level3  B   FOURTH  
53  Level3  B   FIFTH   
54  Level3  B   SIXTH   
55  Level3  (all)   FIRST   
56  Level3  (all)   SECOND  
57  Level3  (all)   THIRD   
58  Level3  (all)   FOURTH  
59  Level3  (all)   FIFTH   
60  Level3  (all)   SIXTH   
61  Level3  A   (all)   
62  Level3  B   (all)   
63  Level3  (all)   (all)   
64  (all)   A   FIRST   
65  (all)   A   SECOND  
66  (all)   A   THIRD   
67  (all)   A   FOURTH  
68  (all)   A   FIFTH   
69  (all)   A   SIXTH   
70  (all)   B   FIRST   
71  (all)   B   SECOND  
72  (all)   B   THIRD   
73  (all)   B   FOURTH  
74  (all)   B   FIFTH   
75  (all)   B   SIXTH   
76  (all)   A   (all)   
77  (all)   B   (all)   
78  (all)   (all)   (all)   

In SAS this is easy using something along the lines of:

PROC SUMMARY data=test;
class Flag1 Flag2 Flag3;
var Type1;
output=final_data Sum=sum(Type1);
run;

This will create all the different subtotals in all levels. (And an "nway" statement would only out the lowest level of the data). I can't seem to find any R-code that is as simple as this? Can you please help me. In my real data I have 7 dimensions, which is why a hardcoded way out of it is not an option.

Please ask elaboration question if the above is unclear.

Best Regards

  • Is the sum value in the example output missing? – StupidWolf Nov 04 '19 at 23:17
  • If I understand this correctly, this is producing a by-group sum plus all of its marginal sums. I suspect that this will either be iterative or reshaping (which would allow for the use of `addmargins`). – r2evans Nov 04 '19 at 23:28
  • I think you're asking this question: https://stackoverflow.com/questions/56709648/summarize-data-at-different-aggregate-levels-r-and-tidyverse – Reeza Nov 04 '19 at 23:43
  • Can you show what your expected output looks like ? – Ronak Shah Nov 05 '19 at 04:30
  • Hi all. @r2evans you're partially correct, since i will also need all different combinations of the 7 dimenions, and not only the 7 marginal sums (which is a subset of what i'm looking for. Reeza I'm not quite sure, since it appears to only creating the marginal sums. Allthough I can be quite wrong? :) – Martin Hoffmann Nov 05 '19 at 07:03
  • have you tried `library(mosaic)` using `favstats(x, data=dataset_name)` ? – Schilker Nov 05 '19 at 11:05

3 Answers3

1

Yep that is a fine questions....

I made your data like this:

data "c:\blah\test.sas7bdat";
input  id  Flag1 $   Flag2 $ Flag3 $  Type1 Type2 Type3;
datalines;
1     Level1    A    FIRST     2     0    0
2     Level1    A    SECOND    1     9    0
3     Level1    A    THIRD     3     7    0
4     Level1    A    FOURTH    9    18    0
5     Level1    A    FIFTH     1    22    0
6     Level1    A    SIXTH     1    13    0
7     Level1    B    FIRST     0     0    0
8     Level1    B    SECOND    3     9    0
9     Level1    B    THIRD     5    85    0
10    Level1    B    FOURTH    4    96    0
11    Level1    B    FIFTH     3    40    0
12    Level1    B    SIXTH     0    17    0
22    Level2    A    FIRST     2     0    0
23    Level2    A    SECOND    1     9    0
24    Level2    A    THIRD     3     7    0
25    Level2    A    FOURTH    9    18    0
26    Level2    A    FIFTH     1    22    0
27    Level2    A    SIXTH     1    13    0
28    Level2    B    FIRST     0     0    0
29    Level2    B    SECOND    3     9    0
30    Level2    B    THIRD     5    85    0
31    Level2    B    FOURTH    4    96    0
32    Level2    B    FIFTH     3    40    0
33    Level2    B    SIXTH     0    17    0
34    Level3    A    FIRST     2     0    0
35    Level3    A    SECOND    1     9    0
36    Level3    A    THIRD     3     7    0
37    Level3    A    FOURTH    9    18    0
38    Level3    A    FIFTH     1    22    0
39    Level3    A    SIXTH     1    13    0
40    Level3    B    FIRST     0     0    0
41    Level3    B    SECOND    3     9    0
42    Level3    B    THIRD     5    85    0
43    Level3    B    FOURTH    4    96    0
44    Level3    B    FIFTH     3    40    0
45    Level3    B    SIXTH     0    17    0
;
run;

In SAS I got the summary like this:

PROC SUMMARY data="c:\blah\test.sas7bdat";
class Flag1 Flag2 Flag3;
var Type1;
output out =final_data (drop = _type_ _freq_) Sum=sum ;
run;

To load the data into R, I used the haven package. Then I used functions from the dplyr package to do the processing

library(haven)
library(dplyr)

# the read_sas() function is in the haven package
test <- read_sas("c:\\blah\\test.sas7bdat")

# This uses dplry functions for nway
done <- test %>%
  group_by(Flag1, Flag2, Flag3) %>%
  summarise(sum(Type1)) %>%
  ungroup() %>%
  rename(Sum = `sum(Type1)`)


# This uses dplry functions to do all the subgroups
done1 <- test %>%
  group_by(Flag1) %>%
  summarise(sum(Type1)) %>%
  rename(Sum = `sum(Type1)`)
done2 <- test %>%
  group_by(Flag1, Flag2) %>%
  summarise(sum(Type1)) %>%
  rename(Sum = `sum(Type1)`)
done3 <- test %>%
  group_by(Flag1, Flag2, Flag3) %>%
  summarise(sum(Type1)) %>%
  rename(Sum = `sum(Type1)`)

all <- bind_rows(done3, done2, done1)

The good news is dplyr uses functions with names that are easy to understand verbs. If you would like to learn more check out R for Data Science

In theory, the code that makes the "done" objects can be put into a function and then run using the map function in the purrr package. I have not figured out how to make the list of all the variables. There has to be a function to this but I can't find it. This post gets you part way:

Yihui on combinations

stringsOfVariables <- capture.output(
  for (i in 1:n) {
    if (i == 1) {
      for (j in 1:n) {
        cat(x[j], "\n")
      }
    }
    else {
      for (j in 1:(n - i + 1)) {
        for (k in (j + i - 1):n) {
          cat(c(x[j:(j + i - 2)], x[k]), "\n",  sep = ",")
        }
      }
    }
  }
)
stringsOfVariables

Here is a function that makes the summary for any set of variables.

library(rlang)
counts <- function(...) {
  vars <- enquos(...)  
  test %>%
    group_by(!!!vars) %>%
    summarise(sum(Type1))
}
done3 <- counts(Flag1,Flag2,Flag3)

Maybe somebody else can add in the missing tidyverse bits...

itsMeInMiami
  • 2,324
  • 1
  • 13
  • 34
1

Borrowing sample data from @r2evans we can use reshape2::dcast() here.

It's a somewhat unusual usage as we use it only for its aggregating and margin properties and not to reshape/pivot to wider.

Note that data.table::dcast() supports several value.vars but margins are not supported so that won't help us here.

library(reshape2)
dcast(dat,  Flag1 + Flag2 + Flag3 ~ ., fun.aggregate = sum, value.var = "Type1", margins = TRUE)
#>     Flag1 Flag2  Flag3  .
#> 1  Level1     A  FIFTH  1
#> 2  Level1     A  FIRST  2
#> 3  Level1     A FOURTH  9
#> 4  Level1     A SECOND  1
#> 5  Level1     A  SIXTH  1
#> 6  Level1     A  THIRD  3
#> 7  Level1     A  (all) 17
#> 8  Level1     B  FIFTH  3
#> 9  Level1     B  FIRST  0
#> 10 Level1     B FOURTH  4
#> 11 Level1     B SECOND  3
#> 12 Level1     B  SIXTH  0
#> 13 Level1     B  THIRD  5
#> 14 Level1     B  (all) 15
#> 15 Level1 (all)  (all) 32
#> 16 Level2     A  FIFTH  1
#> 17 Level2     A  FIRST  2
#> 18 Level2     A FOURTH  9
#> 19 Level2     A SECOND  1
#> 20 Level2     A  SIXTH  1
#> 21 Level2     A  THIRD  3
#> 22 Level2     A  (all) 17
#> 23 Level2     B  FIFTH  3
#> 24 Level2     B  FIRST  0
#> 25 Level2     B FOURTH  4
#> 26 Level2     B SECOND  3
#> 27 Level2     B  SIXTH  0
#> 28 Level2     B  THIRD  5
#> 29 Level2     B  (all) 15
#> 30 Level2 (all)  (all) 32
#> 31 Level3     A  FIFTH  1
#> 32 Level3     A  FIRST  2
#> 33 Level3     A FOURTH  9
#> 34 Level3     A SECOND  1
#> 35 Level3     A  SIXTH  1
#> 36 Level3     A  THIRD  3
#> 37 Level3     A  (all) 17
#> 38 Level3     B  FIFTH  3
#> 39 Level3     B  FIRST  0
#> 40 Level3     B FOURTH  4
#> 41 Level3     B SECOND  3
#> 42 Level3     B  SIXTH  0
#> 43 Level3     B  THIRD  5
#> 44 Level3     B  (all) 15
#> 45 Level3 (all)  (all) 32
#> 46  (all) (all)  (all) 96

You can then rename your column, or to avoid renaming, just create a constant column first :

dat$whatev <- "Sum"
dcast(dat,  Flag1 + Flag2 + Flag3 ~ whatev, fun.aggregate = sum, value.var = "Type1", margins = TRUE)

get all sums in the same table

Since other answers provided it, if you want all sums in the same table you can do this :

library(reshape2)

sum_with_margins <- function(col) {
   dat$whatev <- col
   dcast(dat,  Flag1 + Flag2 + Flag3 ~ whatev, fun.aggregate = sum, 
         value.var = col, margins = paste0("Flag",1:3))
}

Reduce(merge, lapply(paste0("Type",1:3),sum_with_margins))
#>     Flag1 Flag2  Flag3 Type1 Type2 Type3
#> 1   (all) (all)  (all)    96   948     0
#> 2  Level1 (all)  (all)    32   316     0
#> 3  Level1     A  (all)    17    69     0
#> 4  Level1     A  FIFTH     1    22     0
#> 5  Level1     A  FIRST     2     0     0
#> 6  Level1     A FOURTH     9    18     0
#> 7  Level1     A SECOND     1     9     0
#> 8  Level1     A  SIXTH     1    13     0
#> 9  Level1     A  THIRD     3     7     0
#> 10 Level1     B  (all)    15   247     0
#> 11 Level1     B  FIFTH     3    40     0
#> 12 Level1     B  FIRST     0     0     0
#> 13 Level1     B FOURTH     4    96     0
#> 14 Level1     B SECOND     3     9     0
#> 15 Level1     B  SIXTH     0    17     0
#> 16 Level1     B  THIRD     5    85     0
#> 17 Level2 (all)  (all)    32   316     0
#> 18 Level2     A  (all)    17    69     0
#> 19 Level2     A  FIFTH     1    22     0
#> 20 Level2     A  FIRST     2     0     0
#> 21 Level2     A FOURTH     9    18     0
#> 22 Level2     A SECOND     1     9     0
#> 23 Level2     A  SIXTH     1    13     0
#> 24 Level2     A  THIRD     3     7     0
#> 25 Level2     B  (all)    15   247     0
#> 26 Level2     B  FIFTH     3    40     0
#> 27 Level2     B  FIRST     0     0     0
#> 28 Level2     B FOURTH     4    96     0
#> 29 Level2     B SECOND     3     9     0
#> 30 Level2     B  SIXTH     0    17     0
#> 31 Level2     B  THIRD     5    85     0
#> 32 Level3 (all)  (all)    32   316     0
#> 33 Level3     A  (all)    17    69     0
#> 34 Level3     A  FIFTH     1    22     0
#> 35 Level3     A  FIRST     2     0     0
#> 36 Level3     A FOURTH     9    18     0
#> 37 Level3     A SECOND     1     9     0
#> 38 Level3     A  SIXTH     1    13     0
#> 39 Level3     A  THIRD     3     7     0
#> 40 Level3     B  (all)    15   247     0
#> 41 Level3     B  FIFTH     3    40     0
#> 42 Level3     B  FIRST     0     0     0
#> 43 Level3     B FOURTH     4    96     0
#> 44 Level3     B SECOND     3     9     0
#> 45 Level3     B  SIXTH     0    17     0
#> 46 Level3     B  THIRD     5    85     0
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
0

Here's a shot that I think gives you all marginals.

Up Front

eg <- do.call(expand.grid, c(lapply(dat[1:3], function(a) c(NA, unique(a))),
                             stringsAsFactors = FALSE))
head(eg)
out <- do.call(
  Map,
  c(unname(eg), list(f = function(f1, f2, f3) {
    subx <- subset(dat, (is.na(f1) | f1 == Flag1) &
                          (is.na(f2) | f2 == Flag2) &
                          (is.na(f3) | f3 == Flag3))
    subx <- subx[,setdiff(colnames(subx), c("Flag1", "Flag2", "Flag3"))]
    c(sapply(subx, sum), "(all)" = sum(unlist(subx)))
  })))
out <- cbind.data.frame(
  sapply(eg, function(a) ifelse(is.na(a), "(all)", as.character(a))),
  do.call(rbind, out),
  stringsAsFactors = FALSE)
rownames(out) <- NULL # cosmetic
out <- out[order(out$Flag1 == "(all)", out$Flag1,
                 out$Flag2 == "(all)", out$Flag2,
                 out$Flag3 == "(all)", out$Flag3),]
out[c(1,2,7,8,13,14,20,21,64,84),] # cherry-pick for this view
#     Flag1 Flag2 Flag3 Type1 Type2 Type3 (all)
# 66 Level1     A FIFTH     1    22     0    23
# 18 Level1     A FIRST     2     0     0     2
# 6  Level1     A (all)    17    69     0    86
# 70 Level1     B FIFTH     3    40     0    43
# 46 Level1     B THIRD     5    85     0    90
# 10 Level1     B (all)    15   247     0   262
# 38 Level1 (all) THIRD     8    92     0   100
# 2  Level1 (all) (all)    32   316     0   348
# 65  (all)     A FIFTH     3    66     0    69
# 1   (all) (all) (all)    96   948     0  1044

Walk-through

  1. The first part generates a frame with all possible combinations of existing Flag* variables, plus the special NA (explained later):

    eg <- do.call(expand.grid, c(lapply(dat[1:3], function(a) c(NA, unique(a))),
                                 stringsAsFactors = FALSE))
    head(eg)
    #    Flag1 Flag2 Flag3
    # 1   <NA>  <NA>  <NA>
    # 2 Level1  <NA>  <NA>
    # 3 Level2  <NA>  <NA>
    # 4 Level3  <NA>  <NA>
    # 5   <NA>     A  <NA>
    # 6 Level1     A  <NA>
    nrow(eg)
    # [1] 84
    
  2. (This seems complex.) Map iterates over each row of eg, but it typically takes individual list/vector components as arguments. Because eg is already a list (data.frame), I need to use do.call to pass columns from eg as individual (not-named) arguments. (Technically, I could have done Map(function(...)..., eg[,1], eg[,2], eg[,3]), but that's less-general, in case you have non-constant or just more columns.)

    Internal to the function, it subsets the whole frame based on simple criteria: if the Flag* variable from eg is NA, then accept all values for that flag, otherwise match exactly. (NB, this is doing a little scope-breach since it is reaching out of the anonymous function to find dat.)

    out <- do.call(
      Map,
      c(unname(eg), list(f = function(f1, f2, f3) {
        subx <- subset(dat, (is.na(f1) | f1 == Flag1) &
                              (is.na(f2) | f2 == Flag2) &
                              (is.na(f3) | f3 == Flag3))
        subx <- subx[,setdiff(colnames(subx), c("Flag1", "Flag2", "Flag3"))]
        c(sapply(subx, sum), "(all)" = sum(unlist(subx)))
      })))
    head(out, n=3)
    # $<NA>
    # Type1 Type2 Type3 (all) 
    #    96   948     0  1044 
    # $Level1
    # Type1 Type2 Type3 (all) 
    #    32   316     0   348 
    # $Level2
    # Type1 Type2 Type3 (all) 
    #    32   316     0   348 
    
  3. Rename the NA to your aesthetic (all), ensuring character (not factor). (Remove the row-names created with cbind..., just cosmetic.)

    out <- cbind.data.frame(
      sapply(eg, function(a) ifelse(is.na(a), "(all)", as.character(a))),
      do.call(rbind, out),
      stringsAsFactors = FALSE)
    rownames(out) <- NULL
    
  4. Order them by flag, (all) last.

    out <- out[order(out$Flag1 == "(all)", out$Flag1,
                     out$Flag2 == "(all)", out$Flag2,
                     out$Flag3 == "(all)", out$Flag3),]
    

I realize that the ordering of Flag3 is not contextually ordinal. For that, I suggest using factor.

Disclaimer: I have not verified all of the sums are correct.


Data, no factors:

dat <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
   Flag1   Flag2  Flag3   Type1 Type2 Type3
1     Level1    A    FIRST     2     0    0
2     Level1    A    SECOND    1     9    0
3     Level1    A    THIRD     3     7    0
4     Level1    A    FOURTH    9    18    0
5     Level1    A    FIFTH     1    22    0
6     Level1    A    SIXTH     1    13    0
7     Level1    B    FIRST     0     0    0
8     Level1    B    SECOND    3     9    0
9     Level1    B    THIRD     5    85    0
10    Level1    B    FOURTH    4    96    0
11    Level1    B    FIFTH     3    40    0
12    Level1    B    SIXTH     0    17    0
22    Level2    A    FIRST     2     0    0
23    Level2    A    SECOND    1     9    0
24    Level2    A    THIRD     3     7    0
25    Level2    A    FOURTH    9    18    0
26    Level2    A    FIFTH     1    22    0
27    Level2    A    SIXTH     1    13    0
28    Level2    B    FIRST     0     0    0
29    Level2    B    SECOND    3     9    0
30    Level2    B    THIRD     5    85    0
31    Level2    B    FOURTH    4    96    0
32    Level2    B    FIFTH     3    40    0
33    Level2    B    SIXTH     0    17    0
34    Level3    A    FIRST     2     0    0
35    Level3    A    SECOND    1     9    0
36    Level3    A    THIRD     3     7    0
37    Level3    A    FOURTH    9    18    0
38    Level3    A    FIFTH     1    22    0
39    Level3    A    SIXTH     1    13    0
40    Level3    B    FIRST     0     0    0
41    Level3    B    SECOND    3     9    0
42    Level3    B    THIRD     5    85    0
43    Level3    B    FOURTH    4    96    0
44    Level3    B    FIFTH     3    40    0
45    Level3    B    SIXTH     0    17    0")
r2evans
  • 141,215
  • 6
  • 77
  • 149