0

Similar to this question here, I am trying to find the difference between the maximum value of a group and the value of the current row.

For instance, if I have the following dataset:

ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
group <- data.frame(Subject=ID, pt=Value)

How would I go about creating a new column called "diff" that would be the difference between the value of the current row and the maximum value in that group?

Thank you for your help!

ModalBro
  • 544
  • 5
  • 25
  • 2
    `group %>% group_by(ID) %>% mutate(diff=max(Value)-Value)`? Did you try anything? If so, please advise so we can see what didn't work for you. – r2evans Jul 28 '18 at 06:22
  • I was trying with the data.table approach that I linked, but didn't have any success. – ModalBro Jul 28 '18 at 06:28
  • 2
    Ok ... it often helps (and is more relevant/absorbable by you) if we see your code and make corrections to your attempts. We can't correct a small mistake in code if you don't include the code and the error or incorrect output. – r2evans Jul 28 '18 at 06:31

3 Answers3

6

The OP has tried a data.table solution. Here, we benefit from grouping and updating by reference simultaneously.

library(data.table)
setDT(group)[, diff := max(pt) - pt, by = Subject][]
   Subject pt diff
1:       1  2    3
2:       1  3    2
3:       1  5    0
4:       2  2   15
5:       2  5   12
6:       2  8    9
7:       2 17    0
8:       3  3    2
9:       3  5    0

Data

ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
group <- data.frame(Subject=ID, pt=Value)

Benchmark

At the time of writing, 5 answers were posted, including Frank's comment on the efficiency of th data.table approach. So, I was wondering which of the five solutions were the fastest.

  1. r2evans
  2. mine
  3. Frank
  4. harelhan
  5. JonMinton

Some solutions modify the data.frame in place. To ensure a fair comparison, In addition,

The OP has required to create a new column called "diff". For comparison, all results should return group with three columns. Some answers were modified accordingly. The answer of harelhan required substantial modifications to remove the errors.

As group is modified, all benchmark runs start with a fresh copy of group with two columns.

The benchmark is parameterized over the number of rows and the share of groups, i.e., the number of groups varies with the problem size in order to scale.

library(data.table)
library(dplyr)
library(bench)
bm <- press(
  # n_row = c(1E2, 1E4, 1E5, 1E6),
  n_row = c(1E2, 1E4, 1E5),
  grp_share = c(0.01, 0.1, 0.5, 0.9),
  {
    n_grp <- grp_share * n_row
    set.seed(1)
    group0 <- data.frame(
      Subject = sample(n_grp, n_row, TRUE),
      pt = as.numeric(rpois(n_row, 100)))
    mark(
      r2Evans = {
        group <- copy(group0)
        group <- group %>% 
          group_by(Subject) %>% 
          mutate(diff = max(pt) - pt)
        group
      },
      Uwe = {
        group <- copy(group0)
        setDT(group)[, diff := max(pt) - pt, by = Subject]
        group
      },
      Frank = {
        group <- copy(group0)
        setDT(group)[, mx := max(pt), by=Subject][, diff := mx - pt][, mx := NULL]
        group
      },
      harelhan = {
        group <- copy(group0)
        max_group <- group %>% group_by(Subject) %>% summarize(max_val = max(pt))
        group <- left_join(group, max_group[, c("Subject", "max_val")], by = "Subject")
        group$diff <- group$max_val - group$pt
        group <- group %>% select(-max_val)
        group
      },
      JonMinton = {
        group <- copy(group0)
        group <- group %>% 
          group_by(Subject) %>% 
          mutate(max_group_val = max(pt)) %>% 
          ungroup() %>% 
          mutate(diff = max_group_val - pt) %>% 
          select(-max_group_val)
        group
      }
    )
  }
)
ggplot2::autoplot(bm)

enter image description here

Uwe
  • 41,420
  • 11
  • 90
  • 134
  • For efficiency, could do this in two stages and take advantage of gmax after https://github.com/Rdatatable/data.table/issues/1414 is done. – Frank Jul 28 '18 at 16:10
1

Using your example data and breaking the logic into smaller steps:

library(dplyr)
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
group <- data.frame(Subject=ID, pt=Value)
max_group <- group %>% group_by(ID) %>% summarize(max_val = max(Value))
group <- left_join(group, max_group[,c(ID, max_val)], by = ID)
group$diff <- group$max_val - group$Value

Hope this solves the problem.

pyr_viz
  • 47
  • 7
  • When I ran your code it says. *Error in grouped_df_impl(data, unname(vars), drop) : Column `ID` is unknown*. Furthermore, please show the result in order to verify and compare. Thank you. – Uwe Jul 30 '18 at 06:09
1

Based on harelhan's answer, but with piping:

require(dplyr)

df <- data_frame(
  id = c(1,1,1,2,2,2,2,3,3),
  value = c(2,3,5,2,5,8,17,3,5)
)
df %>% 
  group_by(id) %>% 
  mutate(max_group_val = max(value)) %>% 
  ungroup() %>% 
  mutate(diff_frm_group_max = max_group_val - value)

 A tibble: 9 x 4
      id value max_group_val diff_frm_group_max
   <dbl> <dbl>         <dbl>              <dbl>
 1     1     2             5                  3
 2     1     3             5                  2
 3     1     5             5                  0
 4     2     2            17                 15
 5     2     5            17                 12
 6     2     8            17                  9
 7     2    17            17                  0
 8     3     3             5                  2
 9     3     5             5                  0
JonMinton
  • 1,239
  • 2
  • 8
  • 26