0

I'm trying to figure out how to do some basic math with a data frame.

I have a data frame that looks like this:

| Version | Total | Case   |
|---------|-------|--------|
| 1.0.1   | 110   | Case 1 |
| 1.0.2   | 111   | Case 1 |
| 1.0.3   | 114   | Case 1 |
| 1.0.4   | 114   | Case 1 |
| 1.0.5   | 113   | Case 1 |
| 1.0.1   |  53   | Case 2 |
| 1.0.2   |  53   | Case 2 |
| 1.0.3   |  56   | Case 2 |
| 1.0.4   |  57   | Case 2 |
| 1.0.5   |  55   | Case 2 |
| 1.0.1   | 110   | Case 3 |
| 1.0.2   | 111   | Case 3 |
| 1.0.3   | 113   | Case 3 |
| 1.0.4   | 114   | Case 3 |
| 1.0.5   | 113   | Case 3 |
| 1.0.1   |  52   | Case 4 |
| 1.0.2   |  53   | Case 4 |
| 1.0.3   |  56   | Case 4 |
| 1.0.4   |  57   | Case 4 |
| 1.0.5   |  55   | Case 4 |

I want to calculate the "percent different" between 'Case 1 and 2' and then also 'Case 3 and 4' for each Version. So for 1.0.1 it would do this math: (110-53)/(.5*(110+53))

Ultimately it would end up with a table that looked like this:

| Version | Total | Case       |
|---------|-------|------------|
| 1.0.1   | 70%   | Case 1 & 2 |
| 1.0.2   | 71%   | Case 1 & 2 |
| 1.0.3   | 68%   | Case 1 & 2 |
| 1.0.4   | 67%   | Case 1 & 2 |
| 1.0.5   | 69%   | Case 1 & 2 |
| 1.0.1   | 72%   | Case 3 & 4 |
| 1.0.2   | 71%   | Case 3 & 4 |
| 1.0.3   | 67%   | Case 3 & 4 |
| 1.0.4   | 67%   | Case 3 & 4 |
| 1.0.5   | 69%   | Case 3 & 4 |

EDIT: Here's a working example of the first table to use.

Version <- c('1.0.1', '1.0.2', '1.0.3', '1.0.4', '1.0.5', '1.0.1', '1.0.2', '1.0.3', '1.0.4', '1.0.5', '1.0.1', '1.0.2', '1.0.3', '1.0.4', '1.0.5', '1.0.1', '1.0.2', '1.0.3', '1.0.4', '1.0.5')
Total <- c(110, 111, 114, 114, 113, 53, 53, 56, 57, 55, 110, 111, 113, 114, 113, 52, 53, 56, 57, 55)
Case <- c('Case 1', 'Case 1', 'Case 1', 'Case 1', 'Case 1', 'Case 2', 'Case 2', 'Case 2', 'Case 2', 'Case 2', 'Case 3', 'Case 3', 'Case 3', 'Case 3', 'Case 3', 'Case 4', 'Case 4', 'Case 4', 'Case 4', 'Case 4')
df <- data.frame(Version, Total, Case)
dww
  • 30,425
  • 5
  • 68
  • 111
JohnN
  • 968
  • 4
  • 13
  • 35
  • This might seriously help you http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega. The `apply` family of functions are essential to using R properly. – Eli Sadoff Oct 20 '16 at 18:19
  • I think melt and cast would be helpful. I'm happy to craft an answer but please add some sample data in a reproducible way so that posters have access to it (without manually entering the text data). http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example will help you! – Joy Oct 20 '16 at 18:30
  • @Joy See my edit. – JohnN Oct 20 '16 at 18:45
  • I deleted the phrase "I'm a bit of an R newbie." from the Q. As a general rule, this type of language is discouraged in SO because it distracts from the question. Moreover, as you can see from the answers the solution is not trivial in any case. – dww Oct 20 '16 at 19:39

2 Answers2

3

You can use library (data.table)

setDT(df)
ans = df[, .(`case 1 & 2` = 200*(.SD[Case=="Case 1", Total] - .SD[Case=="Case 2", Total]) / (.SD[Case=="Case 1", Total] + .SD[Case=="Case 2", Total]),
             `case 3 & 4` = 200*(.SD[Case=="Case 3", Total] - .SD[Case=="Case 4", Total]) / (.SD[Case=="Case 1", Total] + .SD[Case=="Case 2", Total])
       ), by=Version]
#    Version case 1 & 2 case 3 & 4
# 1:   1.0.1   69.93865   71.16564
# 2:   1.0.2   70.73171   70.73171
# 3:   1.0.3   68.23529   67.05882
# 4:   1.0.4   66.66667   66.66667
# 5:   1.0.5   69.04762   69.04762

If you need this in long format you can use melt

melt(ans, id="Version")
#    Version   variable    value
# 1:   1.0.1 case 1 & 2 69.93865
# 2:   1.0.2 case 1 & 2 70.73171
# 3:   1.0.3 case 1 & 2 68.23529
# 4:   1.0.4 case 1 & 2 66.66667
# 5:   1.0.5 case 1 & 2 69.04762
# 6:   1.0.1 case 3 & 4 71.16564
# 7:   1.0.2 case 3 & 4 70.73171
# 8:   1.0.3 case 3 & 4 67.05882
# 9:   1.0.4 case 3 & 4 66.66667
#10:   1.0.5 case 3 & 4 69.04762

One additional piece of advice: I would recommend to not use spaces or special characters in column names. Although you can get away with it here by using backticks around the names, it can cause issues. Better to call the columns something like case_a_b

dww
  • 30,425
  • 5
  • 68
  • 111
2

Another solution using data.table with dcast:

library(data.table)
dt <- fread(" Version | Total | Case  
             1.0.1   | 110   | Case 1 
             1.0.2   | 111   | Case 1 
             1.0.3   | 114   | Case 1 
             1.0.4   | 114   | Case 1 
             1.0.5   | 113   | Case 1 
             1.0.1   |  53   | Case 2 
             1.0.2   |  53   | Case 2 
             1.0.3   |  56   | Case 2 
             1.0.4   |  57   | Case 2 
             1.0.5   |  55   | Case 2 
             1.0.1   | 110   | Case 3 
             1.0.2   | 111   | Case 3 
             1.0.3   | 113   | Case 3 
             1.0.4   | 114   | Case 3 
             1.0.5   | 113   | Case 3 
             1.0.1   |  52   | Case 4 
             1.0.2   |  53   | Case 4 
             1.0.3   |  56   | Case 4 
             1.0.4   |  57   | Case 4 
             1.0.5   |  55   | Case 4 ")

dcast(dt, Version ~ Case, value.var = "Total")[,
        .(Version, Case_1_2 = (`Case 1`-`Case 2`)/(.5*(`Case 1`+`Case 2`)),
          Case_3_4 = (`Case 3`-`Case 4`)/(.5*(`Case 3`+`Case 4`)))]

   Version  Case_1_2  Case_3_4
1:   1.0.1 0.6993865 0.7160494
2:   1.0.2 0.7073171 0.7073171
3:   1.0.3 0.6823529 0.6745562
4:   1.0.4 0.6666667 0.6666667
5:   1.0.5 0.6904762 0.6904762
HubertL
  • 19,246
  • 3
  • 32
  • 51
  • To get the requested output it would be something like: `paste0( round( 100*Case_1_2, 0),"%")` or perhaps an `sprintf` call. – IRTFM Oct 20 '16 at 19:31