Calculating percent difference from R data frame

Question

I'm trying to figure out how to do some basic math with a data frame.

I have a data frame that looks like this:

| Version | Total | Case   |
|---------|-------|--------|
| 1.0.1   | 110   | Case 1 |
| 1.0.2   | 111   | Case 1 |
| 1.0.3   | 114   | Case 1 |
| 1.0.4   | 114   | Case 1 |
| 1.0.5   | 113   | Case 1 |
| 1.0.1   |  53   | Case 2 |
| 1.0.2   |  53   | Case 2 |
| 1.0.3   |  56   | Case 2 |
| 1.0.4   |  57   | Case 2 |
| 1.0.5   |  55   | Case 2 |
| 1.0.1   | 110   | Case 3 |
| 1.0.2   | 111   | Case 3 |
| 1.0.3   | 113   | Case 3 |
| 1.0.4   | 114   | Case 3 |
| 1.0.5   | 113   | Case 3 |
| 1.0.1   |  52   | Case 4 |
| 1.0.2   |  53   | Case 4 |
| 1.0.3   |  56   | Case 4 |
| 1.0.4   |  57   | Case 4 |
| 1.0.5   |  55   | Case 4 |

I want to calculate the "percent different" between 'Case 1 and 2' and then also 'Case 3 and 4' for each Version. So for 1.0.1 it would do this math: (110-53)/(.5*(110+53))

Ultimately it would end up with a table that looked like this:

| Version | Total | Case       |
|---------|-------|------------|
| 1.0.1   | 70%   | Case 1 & 2 |
| 1.0.2   | 71%   | Case 1 & 2 |
| 1.0.3   | 68%   | Case 1 & 2 |
| 1.0.4   | 67%   | Case 1 & 2 |
| 1.0.5   | 69%   | Case 1 & 2 |
| 1.0.1   | 72%   | Case 3 & 4 |
| 1.0.2   | 71%   | Case 3 & 4 |
| 1.0.3   | 67%   | Case 3 & 4 |
| 1.0.4   | 67%   | Case 3 & 4 |
| 1.0.5   | 69%   | Case 3 & 4 |

EDIT: Here's a working example of the first table to use.

Version <- c('1.0.1', '1.0.2', '1.0.3', '1.0.4', '1.0.5', '1.0.1', '1.0.2', '1.0.3', '1.0.4', '1.0.5', '1.0.1', '1.0.2', '1.0.3', '1.0.4', '1.0.5', '1.0.1', '1.0.2', '1.0.3', '1.0.4', '1.0.5')
Total <- c(110, 111, 114, 114, 113, 53, 53, 56, 57, 55, 110, 111, 113, 114, 113, 52, 53, 56, 57, 55)
Case <- c('Case 1', 'Case 1', 'Case 1', 'Case 1', 'Case 1', 'Case 2', 'Case 2', 'Case 2', 'Case 2', 'Case 2', 'Case 3', 'Case 3', 'Case 3', 'Case 3', 'Case 3', 'Case 4', 'Case 4', 'Case 4', 'Case 4', 'Case 4')
df <- data.frame(Version, Total, Case)

This might seriously help you http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega. The `apply` family of functions are essential to using R properly. — Eli Sadoff, Oct 20 '16 at 18:19
I think melt and cast would be helpful. I'm happy to craft an answer but please add some sample data in a reproducible way so that posters have access to it (without manually entering the text data). http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example will help you! — Joy, Oct 20 '16 at 18:30
I deleted the phrase "I'm a bit of an R newbie." from the Q. As a general rule, this type of language is discouraged in SO because it distracts from the question. Moreover, as you can see from the answers the solution is not trivial in any case. — dww, Oct 20 '16 at 19:39

dww · Answer 1 · 2016-10-20T19:42:06.207

You can use library (data.table)

setDT(df)
ans = df[, .(`case 1 & 2` = 200*(.SD[Case=="Case 1", Total] - .SD[Case=="Case 2", Total]) / (.SD[Case=="Case 1", Total] + .SD[Case=="Case 2", Total]),
             `case 3 & 4` = 200*(.SD[Case=="Case 3", Total] - .SD[Case=="Case 4", Total]) / (.SD[Case=="Case 1", Total] + .SD[Case=="Case 2", Total])
       ), by=Version]
#    Version case 1 & 2 case 3 & 4
# 1:   1.0.1   69.93865   71.16564
# 2:   1.0.2   70.73171   70.73171
# 3:   1.0.3   68.23529   67.05882
# 4:   1.0.4   66.66667   66.66667
# 5:   1.0.5   69.04762   69.04762

If you need this in long format you can use melt

melt(ans, id="Version")
#    Version   variable    value
# 1:   1.0.1 case 1 & 2 69.93865
# 2:   1.0.2 case 1 & 2 70.73171
# 3:   1.0.3 case 1 & 2 68.23529
# 4:   1.0.4 case 1 & 2 66.66667
# 5:   1.0.5 case 1 & 2 69.04762
# 6:   1.0.1 case 3 & 4 71.16564
# 7:   1.0.2 case 3 & 4 70.73171
# 8:   1.0.3 case 3 & 4 67.05882
# 9:   1.0.4 case 3 & 4 66.66667
#10:   1.0.5 case 3 & 4 69.04762

One additional piece of advice: I would recommend to not use spaces or special characters in column names. Although you can get away with it here by using backticks around the names, it can cause issues. Better to call the columns something like case_a_b

The logic needed to do cross row indexing is nicely illustrated here. — IRTFM, Oct 20 '16 at 19:32
Don't worry, Those columns are just place holders to make it more obvious whats going on there. — JohnN, Oct 21 '16 at 20:27

score 2 · Answer 2 · answered Oct 20 '16 at 18:43

Another solution using data.table with dcast:

library(data.table)
dt <- fread(" Version | Total | Case  
             1.0.1   | 110   | Case 1 
             1.0.2   | 111   | Case 1 
             1.0.3   | 114   | Case 1 
             1.0.4   | 114   | Case 1 
             1.0.5   | 113   | Case 1 
             1.0.1   |  53   | Case 2 
             1.0.2   |  53   | Case 2 
             1.0.3   |  56   | Case 2 
             1.0.4   |  57   | Case 2 
             1.0.5   |  55   | Case 2 
             1.0.1   | 110   | Case 3 
             1.0.2   | 111   | Case 3 
             1.0.3   | 113   | Case 3 
             1.0.4   | 114   | Case 3 
             1.0.5   | 113   | Case 3 
             1.0.1   |  52   | Case 4 
             1.0.2   |  53   | Case 4 
             1.0.3   |  56   | Case 4 
             1.0.4   |  57   | Case 4 
             1.0.5   |  55   | Case 4 ")

dcast(dt, Version ~ Case, value.var = "Total")[,
        .(Version, Case_1_2 = (`Case 1`-`Case 2`)/(.5*(`Case 1`+`Case 2`)),
          Case_3_4 = (`Case 3`-`Case 4`)/(.5*(`Case 3`+`Case 4`)))]

   Version  Case_1_2  Case_3_4
1:   1.0.1 0.6993865 0.7160494
2:   1.0.2 0.7073171 0.7073171
3:   1.0.3 0.6823529 0.6745562
4:   1.0.4 0.6666667 0.6666667
5:   1.0.5 0.6904762 0.6904762

To get the requested output it would be something like: `paste0( round( 100*Case_1_2, 0),"%")` or perhaps an `sprintf` call. — IRTFM, Oct 20 '16 at 19:31

Calculating percent difference from R data frame

2 Answers2