I have a data frame that looks like this:
df <- data.frame(
Logical = c(TRUE,FALSE,FALSE,FALSE,FALSE,FALSE),
A = c(1,2,3,2,3,1),
B = c(1,0.05,0.80,0.05,0.80,1),
C = c(1,10.80,15,10.80,15,1))
Which looks like:
Logical A B C
1 TRUE 1 1.00 1.0
2 FALSE 2 0.05 10.8
3 FALSE 3 0.80 15.0
4 FALSE 2 0.05 10.8
5 FALSE 3 0.80 15.0
6 FALSE 1 1.00 1.0
I want to add a new variable, D
, which is an integer based on the following rules: either a 0
if df$Logical
is TRUE
, or an integer that is the same for all rows of variables A
, B
and C
that are approximately (because they are doubles, so within a floating point margin of error) equal, starting at 1
.
The expected output here:
Logical A B C D
1 TRUE 1 1.00 1.0 0
2 FALSE 2 0.05 10.8 1
3 FALSE 3 0.80 15.0 2
4 FALSE 2 0.05 10.8 1
5 FALSE 3 0.80 15.0 2
6 FALSE 1 1.00 1.0 3
First row gets 0
because Logical
is TRUE
, second and fourth row get 1
because the variables A
, B
and C
are approximately equal there, same for second and fifth row. Row six gets a 3
because it is the next unique row. Note that the order of integers assigned in D
is irrelevant except for the 0
. e.g., rows 2 and 4 could also be assigned 2
as long as this integer is unique in the other cases of D
.
I have considered using aggregating functions. For example using ddply
:
library("plyr")
df$foo <- 1:nrow(df)
foo <- dlply(df,.(A,B,C),'[[',"foo")
df$D <- 0
for (i in 1:length(foo)) df$D[foo[[i]]] <- i
df$D[df$Logical] <- 0
works, but I am not sure how well this will do with floating point errors (I guess I could round the values here before this call and it should be quite stable though). With a loop it is quite easy:
df$D <- 0
c <- 1
for (i in 1:nrow(df))
{
if (!isTRUE(df$Logical[i]) & df$D[i]==0)
{
par <- sapply(1:nrow(df),function(j)!df$Logical[j]&isTRUE(all.equal(unlist(df[j,c("A" ,"B", "C")]),unlist(df[i,c("A" ,"B", "C")]))))
df$D[par] <- c
c <- c+1
}
}
but this is very slow for larger data frames.