3

I have this dataset:

library(data.table)    
dt <- data.table(
  record=c(1:20),
  area=rep(LETTERS[1:4], c(4, 6, 3, 7)), 
  score=c(1,1:3,2:3,1,1,1,2,2,1,2,1,1,1,1,1:3),
  cluster=c("X", "Y", "Z")[c(1,1:3,3,2,1,1:3,1,1:3,3,3,3,1:3)]
)

and I have used the solution from this post to create this summary:

dt_summary = 
  dt[ , .N, keyby = .(area, score, cluster)
      ][ , {
        idx = frank(-N, ties.method = 'min') == 1
        NN = sum(N)
        .(
          cluster_mode = cluster[idx],
          cluster_pct = 100*N[idx]/NN,
          cluster_freq = N[idx],
          record_freq = NN
        )
      }, by = .(area, score)]
dt_score_1 <- dt_summary[score == 1]
setnames(dt_score_1, "area", "zone")

I would like to use the results from dt_score_1 to filter dt based on the area/zone and cluster/cluster_mode. So in a new data.table, the only rows taken from dt for area A should belong to cluster X, for area D they should be cluster Z etc.

Chris
  • 1,197
  • 9
  • 28
  • 1
    Maybe related: https://stackoverflow.com/questions/18969420/perform-a-semi-join-with-data-table – Frank Oct 15 '18 at 14:22

1 Answers1

5

If I'm understanding the question correctly, this is a merge of dt with dt_score_1 with the conditions area = zone, cluster = cluster_mode.

dt[dt_score_1, on = .(area = zone, cluster = cluster_mode)]

#     record area score cluster i.score cluster_pct cluster_freq record_freq
#  1:      1    A     1       X       1   100.00000            2           2
#  2:      2    A     1       X       1   100.00000            2           2
#  3:      7    B     1       X       1    66.66667            2           3
#  4:      8    B     1       X       1    66.66667            2           3
#  5:     11    C     2       X       1   100.00000            1           1
#  6:     12    C     1       X       1   100.00000            1           1
#  7:     14    D     1       Z       1    80.00000            4           5
#  8:     15    D     1       Z       1    80.00000            4           5
#  9:     16    D     1       Z       1    80.00000            4           5
# 10:     17    D     1       Z       1    80.00000            4           5
# 11:     20    D     3       Z       1    80.00000            4           5

For a more detailed explanation of join-as-filter, see the link below posted by @Frank

Perform a semi-join with data.table

IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
  • For reference, this type of problem is called a "join". The above solution is a right join of dt to dt_score_1 using area = zone and cluster = cluster_mode as the join keys. – Scott Ritchie Oct 15 '18 at 14:24
  • Thanks. In my mind it's a filter/subset, because what I actually need is something like this: dt[dt_score_1[,c("zone","cluster_mode")], on = .(area = zone, cluster = cluster_mode)] – Chris Oct 15 '18 at 14:29
  • 1
    Yeah, I'd guess from the syntax similarities that the `data.table` authors also see a right-join as similar to a subset. – IceCreamToucan Oct 15 '18 at 14:30
  • 1
    If you just want the columns from `dt`, I've found the cleanest solution is something like this: `dt[dt_score_1, on = .(area = zone, cluster = cluster_mode), .SD, .SDcols = names(dt)]` which means you don't need to type out the column names twice. – Scott Ritchie Oct 15 '18 at 14:33
  • 2
    Small discrepancies of x[i] vs filter (highlighted in hadley's semi join question that I linked): every row of i gets a row in the result, so unmatched rows from i leave rows with NAs for non-join columns (unless nomatch=0); and duplicate rows from i show up separately. Also, you get columns of i that you might not want. – Frank Oct 15 '18 at 14:36
  • Worth noting you can add `nomatch=0` to the data.table call to filter out rows in `i` without matches. That being said, a subset operation will also leave NAs where the item isn't found, e.g. `vec <- c(a=1, b=2, c=3); vec[c("b", "d")]`. – Scott Ritchie Oct 15 '18 at 14:51