3

I have a data frame that looks like this:

df <- data.frame(
  Logical = c(TRUE,FALSE,FALSE,FALSE,FALSE,FALSE),
  A = c(1,2,3,2,3,1),
  B = c(1,0.05,0.80,0.05,0.80,1),
  C = c(1,10.80,15,10.80,15,1))

Which looks like:

  Logical A    B    C
1    TRUE 1 1.00  1.0
2   FALSE 2 0.05 10.8
3   FALSE 3 0.80 15.0
4   FALSE 2 0.05 10.8
5   FALSE 3 0.80 15.0
6   FALSE 1 1.00  1.0

I want to add a new variable, D, which is an integer based on the following rules: either a 0 if df$Logical is TRUE, or an integer that is the same for all rows of variables A, B and C that are approximately (because they are doubles, so within a floating point margin of error) equal, starting at 1.

The expected output here:

  Logical A    B    C D
1    TRUE 1 1.00  1.0 0
2   FALSE 2 0.05 10.8 1
3   FALSE 3 0.80 15.0 2
4   FALSE 2 0.05 10.8 1
5   FALSE 3 0.80 15.0 2
6   FALSE 1 1.00  1.0 3

First row gets 0 because Logical is TRUE, second and fourth row get 1 because the variables A, B and C are approximately equal there, same for second and fifth row. Row six gets a 3 because it is the next unique row. Note that the order of integers assigned in D is irrelevant except for the 0. e.g., rows 2 and 4 could also be assigned 2 as long as this integer is unique in the other cases of D.


I have considered using aggregating functions. For example using ddply:

library("plyr")
df$foo <- 1:nrow(df)
foo <- dlply(df,.(A,B,C),'[[',"foo")
df$D <- 0
for (i in 1:length(foo)) df$D[foo[[i]]] <- i
df$D[df$Logical] <- 0

works, but I am not sure how well this will do with floating point errors (I guess I could round the values here before this call and it should be quite stable though). With a loop it is quite easy:

df$D <- 0
c <- 1
for (i in 1:nrow(df))
{
  if (!isTRUE(df$Logical[i]) & df$D[i]==0)
  {
    par <- sapply(1:nrow(df),function(j)!df$Logical[j]&isTRUE(all.equal(unlist(df[j,c("A" ,"B", "C")]),unlist(df[i,c("A" ,"B", "C")]))))
    df$D[par] <- c
    c <- c+1
  }
}

but this is very slow for larger data frames.

Arun
  • 116,683
  • 26
  • 284
  • 387
Sacha Epskamp
  • 46,463
  • 20
  • 113
  • 131
  • Could you convert columns `A`, `B` and `C` to factors? With the sample dataset, that looks like it would be OK (wrt tolerance issues of floating point numbers) – BenBarnes Oct 25 '12 at 12:03

1 Answers1

4

As per Matthew Dowle's comments below, data.table can group numeric values, distinguishing between them with .Machine$double.eps^.5 tolerance. With that in mind, a data.table solution should work:

library(data.table)

DT <- as.data.table(df)

DT[, D := 0]

.GRP <- 0

DT[!Logical, D := .GRP <- .GRP + 1, by = "A,B,C"]

#    Logical A    B    C foo D
# 1:    TRUE 1 1.00  1.0   1 0
# 2:   FALSE 2 0.05 10.8   2 1
# 3:   FALSE 3 0.80 15.0   3 2
# 4:   FALSE 2 0.05 10.8   4 1
# 5:   FALSE 3 0.80 15.0   5 2
# 6:   FALSE 1 1.00  1.0   6 3

As Matthew Dowle writes here, .GRP is implemented in data.table 1.8.3, but I'm still with 1.8.2


Follow up from comments, here's the NEWS item from 1.8.2. Will add to ?data.table, thanks for highlighting!

Numeric columns (type double) are now allowed in keys and ad hoc by. J() and SJ() no longer coerce double to integer. i join columns which mismatch on numeric type are coerced silently to match the type of x's join column. Two floating point values are considered equal (by grouping and binary search joins) if their difference is within sqrt(.Machine$double.eps), by default. See example in ?unique.data.table. Completes FRs #951, #1609 and #1075. This paves the way for other atomic types which use double (such as POSIXct and bit64). Thanks to Chris Neff for beta testing and finding problems with keys of two numeric columns (bug #2004), fixed and tests added.

Community
  • 1
  • 1
BenBarnes
  • 19,114
  • 6
  • 56
  • 74
  • 1
    Yes that should work. I don't quite understand the first sentence about `factor` though. `data.table` internally has code to group `double` columns within machine tolerance, keeping `double` as `double`. It doesn't convert to `character` or `factor` and rely on formatting precision, like base does. See `example(unique.data.table)` for a `tan(pi(...))` example. The documentation could be clearer up in `?data.table` that grouping `double` columns is within machine tolerance. It uses the same tolerance as `all.equal` i.e. `.Machine$double.eps ^ 0.5`. – Matt Dowle Oct 25 '12 at 15:43
  • @MatthewDowle, thanks for the clarification. The `factor` stuff was a bit of a remnant from an earlier version of the answer. I'll clear it up after looking at `example(unique.data.table)`. – BenBarnes Oct 25 '12 at 15:45
  • Ok cool. I was surprised there's nothing about grouping tolerance for `double` in `?data.table`, so will put something in ... – Matt Dowle Oct 25 '12 at 15:52
  • 1
    @MatthewDowle, Thanks again. With the (great!) ability to use `double` columns as key columns, I'd find it very helpful to have info under `?data.table` mentioning the tolerance (or maybe it's there and I missed it...) – BenBarnes Oct 25 '12 at 15:55
  • Oh, it was in NEWS only from 1.8.2. I'll add that item as edit, and will add to `?data.table`... – Matt Dowle Oct 25 '12 at 15:58