-1

So I need to calculate the percent difference of all combinations of values in column y. For example the difference between B 1 and B 2. Then the Difference between B 1 and B 3, and so on for all combinations of B. Then the same thing for all combinations of D.

Here is some example data...

structure(list(Levelname = c("B 1", "B 2", "B 3", 
"B 4", "D 1", "D 2", "D 3", "D 4"), y = c(0.679428655093332, 
1.07554328679719, 0.883000346050764, 0.791772867506205, 0.538143790501689, 
0.805122127560562, 0.591353204313314, 0.795225886492002), fill = c("midnightblue", 
"dodgerblue4", "steelblue3", "lightskyblue", "midnightblue", 
"dodgerblue4", "steelblue3", "lightskyblue"), species = c("White Grunt", 
"White Grunt", "White Grunt", "White Grunt", "White Grunt", "White Grunt", 
"White Grunt", "White Grunt")), row.names = c(NA, -8L), class = "data.frame")

My ideal output would be a dataframe with some sort of identifier like

Pair           Percent Difference
B 1 - B 2      45.142
B 1 - B 3      .....
B 1 - B 4      .....
B 2 - B 3      .....
B 2 - B 4      .....
B 3 - B 4      .....
D 1 - D 2      .....
D 1 - D 3      .....
D 1 - D 4      .....
D 2 - D 3      .....
D 2 - D 4      .....
D 3 - D 4      .....
where ..... are the percent differences

I don't care about the differences between B and D. Also I'm trying to get better at functions, for loops, and the apply functions of r, so if answers can use those or a variety of those that would be great.

I tried to look at these answers but I couldn't figure it out...

Loops in R - Need to use index, anyway to avoid 'for'? How can I calculate the percentage change within a group for multiple columns in R?

The 45.142 I calculated using this

   |B1−B2|/[(B1+B2)/2]×100=? 
    =|0.67942865509333−1.0755433|/[(0.67942865509333+1.0755433)/2]×100
    =|−0.39611464490667|/[1.7549719550933/2]×100
    =0.39611464490667/0.87748597754667×100
    =0.45142×100
    =45.142% difference
jay.sf
  • 60,139
  • 8
  • 53
  • 110
Johnny5ish
  • 295
  • 1
  • 2
  • 12
  • 1
    How did you get the 45.142? – Onyambu Sep 19 '21 at 21:35
  • @Onyambu I added the calculation to the question. – Johnny5ish Sep 19 '21 at 21:48
  • @akrun No I have a set of all B and D outputs for each species in separate data frames. – Johnny5ish Sep 19 '21 at 21:48
  • @akrun insulting to me? I didn't think so, I was just replying to you. I'm always trying to get better at asking better questions. Now that everyone knows what i was looking for and the question has been edited, is there anything i could do to be more clear? I'm thinking if I can write better questions other people will find useful and upvote more often. Thanks! – Johnny5ish Sep 19 '21 at 22:12
  • 1
    when you posted the question, the dput was kind of messy and then the data was changed multiple times. Earlier it was `B High` etc, then I found the output was different and you had a different function `percent` or something. Anyway, it is easier to post a solution once everything is fixed and clear as in the calculation that was edited and added – akrun Sep 19 '21 at 22:16
  • You guys are so fast, which i really appreciate, but in this case you were working through solutions while I was still fixing error. I'll try to be more vigilant next time. I do think there should be comments associated with up and down votes. Now I have no idea why someone down voted this. How is that helpful? – Johnny5ish Sep 20 '21 at 01:55
  • 1
    @Johnny5ish There may be many reasons for downvotes, just don't care about in this case. You didn't show much effort of your own, but I've seen worse questions that didn't get downvoted. BTW see the [benchmark](https://stackoverflow.com/a/69247303/6574038) I added to my answer, that might interest you. – jay.sf Sep 20 '21 at 06:32
  • @jay.sf thanks for the advice. That is much faster. – Johnny5ish Sep 24 '21 at 03:19

2 Answers2

2

Using tidyverse:

library(tidyverse)

df %>%
  group_by(grp = str_extract(Levelname, "\\w+"))%>%
  summarise(pair = combn(Levelname, 2, str_c, collapse = " - "),
            perc_diff = combn(y, 2, function(x) 200*abs(diff(x))/sum(x)),
            .groups = 'drop')



A tibble: 12 x 3
   grp   pair      perc_diff
   <chr> <chr>         <dbl>
 1 B     B 1 - B 2     45.1 
 2 B     B 1 - B 3     26.1 
 3 B     B 1 - B 4     15.3 
 4 B     B 2 - B 3     19.7 
 5 B     B 2 - B 4     30.4 
 6 B     B 3 - B 4     10.9 
 7 D     D 1 - D 2     39.8 
 8 D     D 1 - D 3      9.42
 9 D     D 1 - D 4     38.6 
10 D     D 2 - D 3     30.6 
11 D     D 2 - D 4      1.24
12 D     D 3 - D 4     29.4 
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • Thank you this is great, I modified the calculation slightly , ```grunt.diff <- grunt.nums %>% group_by(grp = str_extract(Levelname, "\\w+"))%>% summarise(pair = combn(Levelname, 2, str_c, collapse = " - "), perc_diff = combn(y, 2, function(x) abs(diff(x))/(sum(x)/2))*100, .groups = 'drop')``` – Johnny5ish Sep 19 '21 at 22:05
1

We can use outer, calculations with the y values and paste with the Levelnames, where we just use the upper.tri in each case.

f <- \(x, y) abs(x - y)*100 / ((x + y) / 2)  ## your p_diff formula

p_diff <- outer(dt$y, dt$y, f) |>
  {\(x) abs(x[upper.tri(x)])}() |>
  round(3)

Pair <- outer(dt$Levelname, dt$Levelname, paste, sep=' - ')|>
  {\(x) x[upper.tri(x)]}()

res <- data.frame(Pair, p_diff)

Result

res
#         Pair p_diff
# 1  B 1 - B 2 45.142
# 2  B 1 - B 3 26.058
# 3  B 2 - B 3 19.662
# 4  B 1 - B 4 15.272
# 5  B 2 - B 4 30.393
# 6  B 3 - B 4 10.894
# 7  B 1 - D 1 23.208
# 8  B 2 - D 1 66.605
# 9  B 3 - D 1 48.532
# 10 B 4 - D 1 38.142
# 11 B 1 - D 2 16.934
# 12 B 2 - D 2 28.758
# 13 B 3 - D 2  9.227
# 14 B 4 - D 2  1.672
# 15 D 1 - D 2 39.751
# 16 B 1 - D 3 13.862
# 17 B 2 - D 3 58.095
# 18 B 3 - D 3 39.563
# 19 B 4 - D 3 28.981
# 20 D 1 - D 3  9.422
# 21 D 2 - D 3 30.615
# 22 B 1 - D 4 15.705
# 23 B 2 - D 4 29.968
# 24 B 3 - D 4 10.460
# 25 B 4 - D 4  0.435
# 26 D 1 - D 4 38.561
# 27 D 2 - D 4  1.237
# 28 D 3 - D 4 29.407

Benchmark

I doubted that the tidy approach was faster, and I was right. Here I provide a benchmark comparing the solutions so far. Accordingly, the outer approach is almost 20 times faster.

f1 <- \() data.frame(p_diff=outer(dt$y, dt$y, f) |>
                       {\(x) abs(x[upper.tri(x)])}() |>
                       round(3), 
                     Pair=outer(dt$Levelname, dt$Levelname, paste, sep=' - ')|>
                       {\(x) x[upper.tri(x)]}())
library(dplyr);library(stringr)
f2 <- \() dt %>%
  group_by(grp = str_extract(Levelname, "\\w+"))%>%
  summarise(pair = combn(Levelname, 2, str_c, collapse = " - "),
            perc_diff = combn(y, 2, function(x) 200*abs(diff(x))/sum(x)),
            .groups = 'drop')
dt <- dt[sample(nrow(dt), 1e3, replace=T), ]

microbenchmark::microbenchmark(outer=f1(), tidyverse=f2(), times=3L)
# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval cld
#     outer  236.7207  243.0496  306.1265  249.3785  340.8294  432.2804     3  a 
# tidyverse 4819.3476 4830.7364 4838.5051 4842.1251 4848.0839 4854.0427     3   b
jay.sf
  • 60,139
  • 8
  • 53
  • 110