2

I have a large data table that has a similar structure to this reproducible example:

d <- data.table(
  g     =c(1  ,1  ,2  ,2  ,2  ,3  ,3  , 4 , 4),
  A     =c('a','a','b','b','w','c','q','r','r'),
  B     =c(1  ,1  ,2  ,2  ,2  ,3  ,3  ,4  , 5),
  sameA =c(T  , T , F , F , F , F , F , T , T),
  sameAB=c(T  , T , F , F , F , F , F , F , F)
 )

g indicates the group and A and B are individual characteristics. I want to create the variables:

  • sameA : indicates that A is the same for all observations of the group
  • sameAB : indicates that A and B are the same for all observations of the group

how can I do this in an efficient way using data.table?

LucasMation
  • 2,408
  • 2
  • 22
  • 45
  • 1
    This should get you going: [Counting unique / distinct values by group in a data frame](https://stackoverflow.com/questions/12840294/counting-unique-distinct-values-by-group-in-a-data-frame/12907956#12907956) – Henrik Jun 19 '20 at 17:00

3 Answers3

3

Perhaps you could try this. Use uniqueN to determine number of unique values per group (with by = g), and compare with 1.

library(data.table)

d[, c("sameAnew", "saveABnew") := list(uniqueN(A) == 1, uniqueN(A) == 1 & uniqueN(B) == 1), by = g]

Or as suggested by @jangorecki, create a temp variable that is suppressed by braces {} to avoid repeating uniqueN(A):

d[, c("sameAnew", "saveABnew") := {tmp <- uniqueN(A); list(tmp == 1, tmp == 1 & uniqueN(B) == 1)}, by = g]

Output

   g A B sameA sameAB sameAnew saveABnew
1: 1 a 1  TRUE   TRUE     TRUE      TRUE
2: 1 a 1  TRUE   TRUE     TRUE      TRUE
3: 2 b 2 FALSE  FALSE    FALSE     FALSE
4: 2 b 2 FALSE  FALSE    FALSE     FALSE
5: 2 w 2 FALSE  FALSE    FALSE     FALSE
6: 3 c 3 FALSE  FALSE    FALSE     FALSE
7: 3 q 3 FALSE  FALSE    FALSE     FALSE
8: 4 r 4  TRUE  FALSE     TRUE     FALSE
9: 4 r 5  TRUE  FALSE     TRUE     FALSE
Ben
  • 28,684
  • 5
  • 23
  • 45
  • 2
    you can avoid calling `uniqueN(A)` twice by `j = {tmp<-uniqueN(A); your_j_arg}` – jangorecki Jun 19 '20 at 20:49
  • @Ben, TKs. This seems to work fine but it is quite slow. I testes on a 1pct sample of the data and still it took quite a long time to compute these. – LucasMation Jun 20 '20 at 04:13
  • @LucasMation I see. Would definitely take a look at @chinsoon12's answer. If `uniqueN` is contributing to slowness (see [github issue](https://github.com/Rdatatable/data.table/issues/3739)) you could try `length(unique())` or `duplicated()` as @chinsoon12 describes. Please feel free to describe your dataset (overall size, and maybe average size of groups)... – Ben Jun 20 '20 at 14:23
  • the dataset is about 500m observations, 100m groups and 10 observations per group. I will test your suggestion and the suggestions below later. From the issue, we can see that is the situation in which `uniqueN` performs poorly. – LucasMation Jun 20 '20 at 18:07
1

Your explanation of what you're looking to have happen isn't exactly precise enough for me to know if I'm going to be giving you the answer you're looking for.

My interpretation is, "I want a data table with only the unique rows"

If that's the case, then the following should work.

d <- as.data.frame(d)
d <- data.table::data.table(unique.data.frame(d))
1

Here is another option:

isdup <- function(x) duplicated(x) | duplicated(x, fromLast=TRUE)

d[, adup := isdup(.SD), .SDcols=c("g", "A")][,
    sameA2 := !any(!adup), g]

d[, sameAB2 := FALSE][
    (sameA2), bdup := isdup(.SD), .SDcols=c("g", "B")][
        (sameA2), sameAB2 := !any(!bdup), g]

And an unorthodox no by version:

d[order(g, A, B), c("sameA3", "sameAB3") := {
    a <- rleid(g, A)
    b <- rleid(g, B)
    newg <- which(diff(g) > 0L)

    nas <- rep(NA_integer_, .N)
    acnt <- a - nafill(nafill(replace(nas, newg + 1L, a[newg]), "locf"), fill=0L)
    bcnt <- b - nafill(nafill(replace(nas, newg + 1L, b[newg]), "locf"), fill=0L)

    acnt[-c(newg, .N)] <- NA_integer_
    acnt <- nafill(acnt, "nocb")
    bcnt[-c(newg, .N)] <- NA_integer_
    bcnt <- nafill(bcnt, "nocb")

    l <- acnt == 1L
    .(l, l & bcnt == 1L)
}]

Would you mind sharing the timings on your large dataset?

chinsoon12
  • 25,005
  • 4
  • 25
  • 35