Summing Contingency Tables in R with first column as character

Question

My sales dataset includes 3 columns: Countries, Sales Type/Method, Total Quarterly Revenue. Here's a display of the first few rows for a better idea:

   Retailer.country Order.method.type    Qtr.Rev
         <fctr>            <fctr>      <dbl>
 1        Australia            E-mail  171407.28
 2        Australia       Sales visit 2013909.18
 3        Australia           Special  158795.34
 4        Australia         Telephone 2289201.87
 5        Australia               Web 1738303.59
 6          Austria       Sales visit   66926.18
 7          Austria         Telephone 1671887.40
 8          Austria               Web 7050164.50
 9          Belgium       Sales visit 1655507.05
10          Belgium               Web 6222440.26
etc.........

Here's the dput of this data:

    structure(list(Retailer.country = structure(c(1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 
7L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 
11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 14L, 
14L, 14L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 17L, 17L, 17L, 
17L, 18L, 18L, 19L, 19L, 19L, 19L, 19L, 20L, 20L, 20L, 20L, 20L, 
21L, 21L, 21L, 21L, 21L, 21L), .Label = c("Australia", "Austria", 
"Belgium", "Brazil", "Canada", "China", "Denmark", "Finland", 
"France", "Germany", "Italy", "Japan", "Korea", "Mexico", 
"Netherlands", 
"Singapore", "Spain", "Sweden", "Switzerland", "United Kingdom", 
"United States"), class = "factor"), Order.method.type = 
structure(c(1L, 
4L, 5L, 6L, 7L, 4L, 6L, 7L, 4L, 7L, 7L, 1L, 2L, 4L, 7L, 2L, 4L, 
6L, 7L, 4L, 7L, 4L, 7L, 2L, 4L, 6L, 7L, 1L, 3L, 4L, 7L, 1L, 2L, 
4L, 5L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 4L, 6L, 7L, 4L, 5L, 7L, 
2L, 3L, 6L, 7L, 2L, 5L, 6L, 7L, 2L, 3L, 6L, 7L, 1L, 7L, 2L, 4L, 
5L, 6L, 7L, 1L, 2L, 4L, 6L, 7L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = 
c("E-mail", 
"Fax", "Mail", "Sales visit", "Special", "Telephone", "Web"), class = 
"factor"), 
    Qtr.Rev = c(171407.28, 2013909.18, 158795.34, 2289201.87, 
    1738303.59, 66926.18, 1671887.4, 7050164.5, 1655507.05, 
    6222440.26, 
    7746789.52, 6864270.12, 195549.5, 450628.79, 12376528.53, 
    415128.31, 1453194.14, 2735416.3, 15777880.11, 413978.16, 
    3776833.13, 308638.6, 12328172.97, 709194.65, 1304167.86, 
    5897377.14, 11048160.97, 1546079.43, 1247170.05, 2373591.15, 
    12102240.99, 2461322.51, 165800.42, 1397604.56, 198705.05, 
    7413833.64, 2662351.94, 289704.5, 680467.87, 87186.72, 343708.86, 
    1802166.73, 16990817.52, 2821127.32, 431860.34, 10144353.75, 
    5063353.42, 1725508.54, 3571760.87, 593828.88, 1074860.66, 
    2981026.86, 5254137.56, 469627.61, 908725.05, 1625096.56, 
    9677070.09, 88788.41, 337710.73, 254360.21, 7835117.44, 
    1292812.39, 
    4818848.86, 217936.39, 792168.42, 790344.28, 109161.04, 
    4565896.64, 
    697619.35, 264500.2, 189218.02, 2022968.96, 13756025.4, 
    1357389.56, 
    2352483.29, 2842600.85, 685752.21, 13437403.28, 29573813.7
    )), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-79L), .Names = c("Retailer.country", "Order.method.type", "Qtr.Rev"
))

I'm creating a contingency table in R that shows the quarterly revenue generated by each sales method for each country. The final output should look similar to this:

Retailer.country     E-mail        Fax      Mail Sales visit   Special  Telephone       Web  TOTAL.cn
1         Australia   171407.3       0.00       0.0  2013909.18  158795.3  2289201.9   1738304   6371617
2           Austria        0.0       0.00       0.0    66926.18       0.0  1671887.4   7050164   8788978
3           Belgium        0.0       0.00       0.0  1655507.05       0.0        0.0   6222440   7877947
4            Brazil        0.0       0.00       0.0        0.00       0.0        0.0   7746790   7746790
5            Canada  6864270.1  195549.50       0.0   450628.79       0.0        0.0  12376529  19886977
6             China        0.0  415128.31       0.0  1453194.14       0.0  2735416.3  15777880  20381619
7           Denmark        0.0       0.00       0.0   413978.16       
...
20   United Kingdom   697619.3  264500.20       0.0   189218.02       0.0  2022969.0  13756025  16930332
21    United States        0.0 1357389.56 2352483.3  2842600.85  685752.2 13437403.3  29573814  50249443
22       TOTAL.type 15695863.0 4767448.43 5692692.6 23233800.42 4811539.3 35257926.7 203769190 293228461

The cast() function from the reshape library gets most of the job done, and only leaves the summary column and row for all values to be computed.

cast(sales.by.country, Retailer.country ~ Order.method.type, 
fill=0) -> sales.by.country

Summing the rows into a new column named "TOTAL.cn" is pretty simple:

sales.by.country$TOTAL.cn <- rowSums(sales.by.country[,c(2:8)])

But summing the columns become a major headache, because the first component of that last row must either be a factor or character. I converted the first column "Retailer.country" to the character type, because it really serves as nothing more than a visual label.

After fussing around with several functions, this is the best code I was able to create to achieve the intended row summation:

# Sum the numeric columns, which is everything *except* column 1
total.by.ordertype <- (colSums(sales.by.country[,-1]))

# Create the Total by Order row
total.by.ordertype.row <- list("TOTAL.type", total.by.ordertype[1], 
total.by.ordertype[2], total.by.ordertype[3], total.by.ordertype[4], 
total.by.ordertype[5], total.by.ordertype[6], total.by.ordertype[7], 
total.by.ordertype[8])

# Add the Total by Order row to the bottom of the table
sales.by.country[22, ] <- total.by.ordertype.row

It works and maintains the proper data types among all columns... BUT I figure there has to be a more efficient way, perhaps by using the apply family of functions, something from dplyr, etc. Maybe the only way is to write my own function?

For example, future datasets may have 50+ different sales methods. When creating the list for the "Total by Order" row (above), I had to call out every single cell within the vector, separated by commas, in order to successfully add this to my existing table. Other efforts converted the data types of all the other columns to characters, which messes everything up.

I don't mind copying/pasting "total.by.ordertype" 8 times so much. But what happens when I'm dealing with 50-100 order types? Is there a cleaner method to reproduce any of this?

Thank you!

how about `rbind(sales.by.country, data.frame(Retailer.country="total", t(total.by.ordertype)))` — user20650, Jul 11 '17 at 23:13
@user20650 Good suggestion, thanks! That method did require re-naming the columns for the total.by.ordertype data, but this does work and prevents the copy/paste repetition of building the list: `data.frame(Retailer.country="Total.type", t(total.by.ordertype)) -> tf` `names(tf)[2:9] <- names(sales.by.country)[2:9]` `rbind(sales.by.country, tf) -> final.sales.by.country` Maintains the correct data types as well. — RVD, Jul 12 '17 at 01:49
RVD, colsums retains the names, so if you can add your original three column data to your question using `dput(yourdata)` we can see why that is not happening. — user20650, Jul 12 '17 at 01:57
@RVD Your `dput ` dataset is problematic. When I ran the code, the console shows this error message: `Error in structure(list(Retailer.country = structure(c(1L, 1L, 1L, 1L, : object 'Retailer.country' not found` — www, Jul 12 '17 at 02:31
@ycw Ah.. I think I may know why that's happening. That "original" 3-column data isn't a typical data frame. It has actually been grouped via dplyr. My script runs the following on a dataset with over 8,600 observations: `sd %>% group_by(Retailer.country, Order.method.type) %>% summarize(Qtr.Rev=sum(Revenue)) -> sd` This returns the data I've posted at the top of the question with only 79 observations. I'm guessing I can ungroup it and run dput, but that's going to create a massive list of output. Do you know of an easier way? — RVD, Jul 12 '17 at 03:05
@RVD Not sure what you are talking about. You posted some codes from dput. I copied and pasted those in my R console, and then it is not working. The purpose of dput is to share data. If it is not working, no one can work on your dataset. It thus difficult to determine what happened when the proposed solution cannot work. — www, Jul 12 '17 at 03:38
@ycw Right. When I attempted to recreate the data frame with the dput output on my own machine, I was seeing the same error as you. The error we're both getting is similar to the one discussed here: https://stackoverflow.com/questions/29039553/reproducible-example-and-dput-error In other words, the original 3 column dataset I posted is a grouped object. I just used ungroup() on that object, ran dput on it again, and have some new output. I'll edit the dput I posted in the original question. It worked for me this time. Please let me know if it's working for you as well. — RVD, Jul 12 '17 at 04:22
@RVD Your updated dput works now. I will update my answer based on the dataset you provided. — www, Jul 12 '17 at 11:16

Uwe · Accepted Answer · 2017-07-12T18:17:00.167

The cast() function from the reshape library can do the whole job. With the parameter margin = TRUE, all row and column totals will be computed:

reshape::cast(sales.by.country, Retailer.country ~ Order.method.type, fun.aggregate = sum, 
     fill = 0, margins = TRUE)

   Retailer.country     E-mail        Fax      Mail Sales visit   Special  Telephone       Web     (all)
1         Australia   171407.3       0.00       0.0  2013909.18  158795.3  2289201.9   1738304   6371617
2           Austria        0.0       0.00       0.0    66926.18       0.0  1671887.4   7050164   8788978
3           Belgium        0.0       0.00       0.0  1655507.05       0.0        0.0   6222440   7877947
4            Brazil        0.0       0.00       0.0        0.00       0.0        0.0   7746790   7746790
5            Canada  6864270.1  195549.50       0.0   450628.79       0.0        0.0  12376529  19886977
6             China        0.0  415128.31       0.0  1453194.14       0.0  2735416.3  15777880  20381619
7           Denmark        0.0       0.00       0.0   413978.16       0.0        0.0   3776833   4190811
8           Finland        0.0       0.00       0.0   308638.60       0.0        0.0  12328173  12636812
9            France        0.0  709194.65       0.0  1304167.86       0.0  5897377.1  11048161  18958901
10          Germany  1546079.4       0.00 1247170.1  2373591.15       0.0        0.0  12102241  17269082
11            Italy  2461322.5  165800.42       0.0  1397604.56  198705.0        0.0   7413834  11637266
12            Japan  2662351.9  289704.50  680467.9    87186.72  343708.9  1802166.7  16990818  22856404
13            Korea        0.0       0.00       0.0  2821127.32       0.0   431860.3  10144354  13397341
14           Mexico        0.0       0.00       0.0  5063353.42 1725508.5        0.0   3571761  10360623
15      Netherlands        0.0  593828.88 1074860.7        0.00       0.0  2981026.9   5254138   9903854
16        Singapore        0.0  469627.61       0.0        0.00  908725.1  1625096.6   9677070  12680519
17            Spain        0.0   88788.41  337710.7        0.00       0.0   254360.2   7835117   8515977
18           Sweden  1292812.4       0.00       0.0        0.00       0.0        0.0   4818849   6111661
19      Switzerland        0.0  217936.39       0.0   792168.42  790344.3   109161.0   4565897   6475507
20   United Kingdom   697619.3  264500.20       0.0   189218.02       0.0  2022969.0  13756025  16930332
21    United States        0.0 1357389.56 2352483.3  2842600.85  685752.2 13437403.3  29573814  50249443
22            (all) 15695863.0 4767448.43 5692692.6 23233800.42 4811539.3 35257926.7 203769190 293228461

Of course, fun.aggregate has to be specified as well.

The same functionality is also available from the reshape2 package, the successor of reshape, but about 4 times faster for this small sample size.

reshape2::dcast(sales.by.country, Retailer.country ~ Order.method.type, fun.aggregate = sum, 
                fill = 0, margins = TRUE)

dcast() is also available from the data.table package which claims to be faster than reshape2::dcast(). Unfortunately, the margins parameter hasn't been implemented yet (current CRAN version 1.10.4). Therefore, the margins have to be computed separately and combined with the original data:

DT2 <- rbind(
  DT,
  DT[, .(Qtr.Rev = sum(Qtr.Rev)), by = Retailer.country],
  DT[, .(Qtr.Rev = sum(Qtr.Rev)), by = Order.method.type],
  DT[, .(Qtr.Rev = sum(Qtr.Rev))], 
  fill = TRUE
)
dcast(DT2, Retailer.country ~ Order.method.type, fill = 0)

    Retailer.country     E-mail        Fax      Mail Sales visit   Special  Telephone       Web        NA
 1:        Australia   171407.3       0.00       0.0  2013909.18  158795.3  2289201.9   1738304   6371617
 2:          Austria        0.0       0.00       0.0    66926.18       0.0  1671887.4   7050164   8788978
 3:          Belgium        0.0       0.00       0.0  1655507.05       0.0        0.0   6222440   7877947
 4:           Brazil        0.0       0.00       0.0        0.00       0.0        0.0   7746790   7746790
 5:           Canada  6864270.1  195549.50       0.0   450628.79       0.0        0.0  12376529  19886977
 6:            China        0.0  415128.31       0.0  1453194.14       0.0  2735416.3  15777880  20381619
 7:          Denmark        0.0       0.00       0.0   413978.16       0.0        0.0   3776833   4190811
 8:          Finland        0.0       0.00       0.0   308638.60       0.0        0.0  12328173  12636812
 9:           France        0.0  709194.65       0.0  1304167.86       0.0  5897377.1  11048161  18958901
10:          Germany  1546079.4       0.00 1247170.1  2373591.15       0.0        0.0  12102241  17269082
11:            Italy  2461322.5  165800.42       0.0  1397604.56  198705.0        0.0   7413834  11637266
12:            Japan  2662351.9  289704.50  680467.9    87186.72  343708.9  1802166.7  16990818  22856404
13:            Korea        0.0       0.00       0.0  2821127.32       0.0   431860.3  10144354  13397341
14:           Mexico        0.0       0.00       0.0  5063353.42 1725508.5        0.0   3571761  10360623
15:      Netherlands        0.0  593828.88 1074860.7        0.00       0.0  2981026.9   5254138   9903854
16:        Singapore        0.0  469627.61       0.0        0.00  908725.1  1625096.6   9677070  12680519
17:            Spain        0.0   88788.41  337710.7        0.00       0.0   254360.2   7835117   8515977
18:           Sweden  1292812.4       0.00       0.0        0.00       0.0        0.0   4818849   6111661
19:      Switzerland        0.0  217936.39       0.0   792168.42  790344.3   109161.0   4565897   6475507
20:   United Kingdom   697619.3  264500.20       0.0   189218.02       0.0  2022969.0  13756025  16930332
21:    United States        0.0 1357389.56 2352483.3  2842600.85  685752.2 13437403.3  29573814  50249443
22:               NA 15695863.0 4767448.43 5692692.6 23233800.42 4811539.3 35257926.7 203769190 293228461
    Retailer.country     E-mail        Fax      Mail Sales visit   Special  Telephone       Web        NA

Excellent, that method makes this process much faster. Thanks! — RVD, Jul 12 '17 at 14:09

www · Answer 2 · 2017-07-12T11:24:49.450

0

A solution using functions from dplyr and tidyr. dt4 is the final output. Notice the use of summarise_if. It is useful when we only want to apply a function for columns that fit a pre-determined condition. In this case, we can apply the sum function only to columns that are numeric.

# Create example data frame
library(dplyr)
library(tidyr)

# sales.by.country is created by OP's dput dataset
dt2 <- sales.by.country %>% 
  mutate(Retailer.country = as.character(Retailer.country)) %>%
  # Spread the data frame
  spread(Order.method.type, Qtr.Rev, fill = 0) %>%
  # Calcualte Total.cn by rowSums
  mutate(TOTAL.cn = rowSums(.[, 2:ncol(.)])) 

# Calculate the sum of each column if it is numeric
dt3 <- dt2 %>% summarise_if(is.numeric, sum)

# Combine dt3 (the summary) to dt2
dt4 <- dt2 %>%
  bind_rows(dt3) %>%
  # Replace the na in Retailer.country to be "TOTAL.type"
  replace_na(list(Retailer.country = "TOTAL.type"))

edited Jul 12 '17 at 11:24

answered Jul 12 '17 at 01:14

www

38,575
12
48
84

I'm getting an error when attempting to declare dt2: `Error: incompatible size (21), expecting 1 (the group size) or 1` – RVD Jul 12 '17 at 02:41
@RVD Are you talking about the dt I created based on your example or your actual dataset? – www Jul 12 '17 at 03:32
I took my original data and saved it in a similar format before running your code. To keep this process reproducible, I just took your code above and pasted it into a new script. dt and dt2 are created without any issue. When I try to initialize dt3, I'm getting this error: `Error in parse(text = x) : :1:7: unexpected symbol 1: Sales visit ^` – RVD Jul 12 '17 at 03:58
@RVD You saw this error because in your original `sales.by.country` the `Retailer.country` column is factor, while when I created my example dataset, `dt`, the `Retailer.country` column is character. I have updated my answer, adding `mutate(Retailer.country = as.character(Retailer.country))` in the process to create `dt2`. Please let me know if this works. – www Jul 12 '17 at 11:22
Yep, got it to work! I had to change the 5th column name of dt2 to "sales.visit" (instead of "sales visit") before running the dt3 line. Otherwise, it throws the same parse error I posted earlier. After that, dt4 works like a charm. Thanks! – RVD Jul 12 '17 at 13:52

score 0 · Answer 3 · answered Nov 12 '19 at 14:36

0

Use tidyr to spread and janitor to add totals column & row:

library(janitor)
library(tidyr)
sales.by.country %>%
  spread(Order.method.type, Qtr.Rev, fill = 0) %>%
  adorn_totals(c("row", "col"))

answered Nov 12 '19 at 14:36

Sam Firke

21,571
9
87
105

Summing Contingency Tables in R with first column as character

3 Answers3