Find greater or smaller value by grouped variable using data.table

Question

My source data has data corresponding to a few months, but of those, I only want to compare data from pre-specified months.

Here's my input data:

dput(mydf)
structure(list(Month = structure(c(1L, 2L, 1L, 2L, 3L, 1L, 2L, 
2L, 1L, 2L, 1L), .Label = c("Aug", "Oct", "Sep"), class = "factor"), 
    Pipe = c(3, 4, 5, 3, 2, 1, 3, 3, 4, NA, 5), Gp = structure(c(1L, 
    1L, 2L, 2L, 2L, 3L, 4L, 5L, 5L, 6L, 6L), .Label = c("A", 
    "B", "C", "D", "E", "F"), class = "factor")), .Names = c("Month", 
"Pipe", "Gp"), row.names = c(NA, -11L), class = "data.frame")

Now, of the three months, I only want to compare the months specified by below variables.

 This_month_to_compare <- "Oct"
  Last_Month_to_compare <- "Aug"

Now, for given two months above and based on groupings Gp, I want to indicate whether Pipe value in This_month_to_compare is greater than that in Last month to compare. If one of the two pipe values don't exist, we will leave it blank.

Here's what the output looks like (manually created because I wasn't successful with the code)

structure(list(Month = structure(c(1L, 2L, 1L, 2L, 3L, 1L, 2L, 
2L, 1L, 2L, 1L), .Label = c("Aug", "Oct", "Sep"), class = "factor"), 
    Pipe = c(3, 4, 5, 3, 2, 1, 3, 3, 4, NA, 5), Gp = structure(c(1L, 
    1L, 2L, 2L, 2L, 3L, 4L, 5L, 5L, 6L, 6L), .Label = c("A", 
    "B", "C", "D", "E", "F"), class = "factor"), Greater = c(NA, 
    TRUE, NA, FALSE, NA, NA, NA, FALSE, NA, NA, NA)), .Names = c("Month", 
"Pipe", "Gp", "Greater"), row.names = c(NA, -11L), class = "data.frame")

Month   Pipe    Gp  Greater Explanation
Aug      3      A           Ignore: Aug
Oct      4      A   TRUE    4 > 3 
Aug      5      B           Ignore: Aug
Oct      3      B   FALSE   3< 5
Sep      2      B           Ignore: Sep
Aug      1      C           Ignore: Aug
Oct      3      D           There is nothing to compare with
Oct      3      E   FALSE   3<4
Aug      4      E           Ignore: Aug
Oct             F           Cannot compare NA with 5
Aug      5      F           Ignore: Aug

I have manually added explanation above.

I did try to code, and here's my attempt:

mydfi<-data.table::as.data.table(mydfi)
  mydf<-mydfi
  #Method 1: Convert to Wide Format
  #Convert to wide format
  mydf<-data.table::dcast(mydf,Gp ~ Month, value.var = "Pipe")
  #Compare
  mydf$Growth<-mydf[[This_month_to_compare]]>mydf[[Last_Month_to_compare]]
  #Back to long format
  Melt_columns<-c("Aug","Oct","Sep")
  mydf<-data.table::melt(mydf, measure.vars =Melt_columns,variable.name = "Month", value.name = "Pipe")
  mydfo<-mydf[mydfi,on=c("Month","Gp","Pipe")]
  mydfo[Month!=This_month_to_compare,"Growth"]<-NA

An update: I was able to solve above problem by just adding a left join. I have updated my code above. However, I am looking for a solution on these lines: Calculate difference between values in consecutive rows by group

The reason is that my actual data set is very big and doesn't permit joins.

Any help would be greatly appreciated. Thanks in advance.

score 1 · Answer 1 · answered Nov 01 '17 at 21:29

Is this what you are thinking of?

> library(data.table)
> mydf <- data.table(mydf)
> This_month_to_compare <- "Oct"
> Last_Month_to_compare <- "Aug"
> setkey(mydf, Gp, Month)
> 
> # Make dummy table to join with
> mydf[
+   , Pipe_this := .SD[Month == This_month_to_compare, Pipe], by = "Gp"][
+     , Pipe_last := .SD[Month == Last_Month_to_compare, Pipe], by = "Gp"][
+       , `:=`(
+         Greater = Pipe_last < Pipe_this, Pipe_last = NULL, Pipe_this = NULL)][
+           Month != "Oct", Greater := NA]
> mydf
    Month Pipe Gp Greater
 1:   Aug    3  A      NA
 2:   Oct    4  A    TRUE
 3:   Aug    5  B      NA
 4:   Oct    3  B   FALSE
 5:   Sep    2  B      NA
 6:   Aug    1  C      NA
 7:   Oct    3  D      NA
 8:   Aug    4  E      NA
 9:   Oct    3  E   FALSE
10:   Aug    5  F      NA
11:   Oct   NA  F      NA

You can simplify the code to avoid two of the [.data.table calls from the above if you want and avoid defining the Pipe_this and Pipe_last.

Uwe · Accepted Answer · 2017-11-06T15:11:52.767

This can be achieved by two joins. The first one filters out the months to compare and orders them as required. Then the comparisions can be done. The second join appends the result to the original data frame.

library(data.table)
# Last_Month_to_compare, This_month_to_compare
months_to_compare <- c("Aug", "Oct")
mDT <- setDT(mydf)[
  # append row id column (to preserve original order)
  , rn := .I][
    # cross join of groups and months
    CJ(Gp = Gp, Month = months_to_compare, unique = TRUE), on = .(Gp, Month)][
      # groupwise comparison of the two months
      , Greater := Pipe > shift(Pipe), by = Gp][]
# appending result to original data frame by joining with intermediate result
mydf[mDT, on = .(rn), Greater := i.Greater][]

    Month Pipe Gp rn Greater
 1:   Aug    3  A  1      NA
 2:   Oct    4  A  2    TRUE
 3:   Aug    5  B  3      NA
 4:   Oct    3  B  4   FALSE
 5:   Sep    2  B  5      NA
 6:   Aug    1  C  6      NA
 7:   Oct    3  D  7      NA
 8:   Oct    3  E  8   FALSE
 9:   Aug    4  E  9      NA
10:   Oct   NA  F 10      NA
11:   Aug    5  F 11      NA

Note that the original order of mydf is preserved.

The intermediate result mDT looks like

    Month Pipe Gp rn Greater
 1:   Aug    3  A  1      NA
 2:   Oct    4  A  2    TRUE
 3:   Aug    5  B  3      NA
 4:   Oct    3  B  4   FALSE
 5:   Aug    1  C  6      NA
 6:   Oct   NA  C NA      NA
 7:   Aug   NA  D NA      NA
 8:   Oct    3  D  7      NA
 9:   Aug    4  E  9      NA
10:   Oct    3  E  8   FALSE
11:   Aug    5  F 11      NA
12:   Oct   NA  F 10      NA

Edit: Additional explanation

The OP has asked for an explanation of the difference between mydf[mDT, on = .(rn)] and mydf[mDT, on = .(rn), Greater := i.Greater][].

With data.table, X[Y, on = ...] is a right outer join which is equivalent to merge(X, Y, all.y = TRUE), i.e., it returns all rows of Y (see JOINing data in R using data.table). So,

mydf[mDT, on = .(rn)]

returns

    Month Pipe Gp rn i.Month i.Pipe i.Gp Greater
 1:   Aug    3  A  1     Aug      3    A      NA
 2:   Oct    4  A  2     Oct      4    A    TRUE
 3:   Aug    5  B  3     Aug      5    B      NA
 4:   Oct    3  B  4     Oct      3    B   FALSE
 5:   Aug    1  C  6     Aug      1    C      NA
 6:    NA   NA NA NA     Oct     NA    C      NA
 7:    NA   NA NA NA     Aug     NA    D      NA
 8:   Oct    3  D  7     Oct      3    D      NA
 9:   Aug    4  E  9     Aug      4    E      NA
10:   Oct    3  E  8     Oct      3    E   FALSE
11:   Aug    5  F 11     Aug      5    F      NA
12:   Oct   NA  F 10     Oct     NA    F      NA

Columns prefixed by i. come from mDT. Note that rows 6 and 7 do not have matching rows in mydf. Also, the order of rows is determined by the order in mDT.

If mydf and mDT are interchanged,

mDT[mydf, on = .(rn)][]

returns

    Month Pipe Gp rn Greater i.Month i.Pipe i.Gp
 1:   Aug    3  A  1      NA     Aug      3    A
 2:   Oct    4  A  2    TRUE     Oct      4    A
 3:   Aug    5  B  3      NA     Aug      5    B
 4:   Oct    3  B  4   FALSE     Oct      3    B
 5:    NA   NA NA  5      NA     Sep      2    B
 6:   Aug    1  C  6      NA     Aug      1    C
 7:   Oct    3  D  7      NA     Oct      3    D
 8:   Oct    3  E  8   FALSE     Oct      3    E
 9:   Aug    4  E  9      NA     Aug      4    E
10:   Oct   NA  F 10      NA     Oct     NA    F
11:   Aug    5  F 11      NA     Aug      5    F

Columns prefixed by i. come from mydf, now. Note that row 5 has no match in mDT. Also, the order of rows is determined by mydf.

With the assignment operator :=, X[Y, on = ..., a := b] becomes a left inner join which includes all rows of X in its original order. Thus,

mydf[mDT, on = .(rn), Greater := i.Greater][]

returns

    Month Pipe Gp rn Greater
 1:   Aug    3  A  1      NA
 2:   Oct    4  A  2    TRUE
 3:   Aug    5  B  3      NA
 4:   Oct    3  B  4   FALSE
 5:   Sep    2  B  5      NA
 6:   Aug    1  C  6      NA
 7:   Oct    3  D  7      NA
 8:   Oct    3  E  8   FALSE
 9:   Aug    4  E  9      NA
10:   Oct   NA  F 10      NA
11:   Aug    5  F 11      NA

where Greater becomes NA for non-matching rows.

@Uwe- Thanks for your help. Could you please help me understand the difference between `mydf[mDT, on = .(rn)][]` and `mydf[mDT, on = .(rn), Greater := i.Greater][]`. I understood what's going on with the former. It's a left join on `rn`, but I am unsure what `Greater := i.Greater` does. I'd appreciate your help. — watchtower, Nov 04 '17 at 06:33

Find greater or smaller value by grouped variable using data.table

2 Answers2

Edit: Additional explanation