2

I have a dataframe of values and for each value in the dataframe I want to determine if it is within say 10% of any other value in its row. I want to do this generically as I do not know how many columns I will have nor the names of the columns.

Some values are NA, if all other values in the row are NA I want to return TRUE. For the actual values which are NA I want to return FALSE. The values are all positive but can be 0.

For example say I have the follwoing dataframe

dataDF <- data.frame(
                     a = c(100, 250,  NA, 700,   0),
                     b = c(105, 300, 280,  NA,   0),
                     c = c(200, 400, 280,  NA,   0)
                     )

In the first row we have a = 100, b = 105 and c = 200. a and b are within 10% of each other so we would have TRUE for both of those, c is not within 10% of either a or b so would be FALSE.

In the second row no values are within 10% of each other so all would be FALSE

In the third row b and c are equal so are TRUE, a is NA so is FALSE.

In the fourth row we only have a value for a so it is returned as TRUE, b and c are FALSE

In the final row all values are the same, so we would have TRUE for all

So my output would be

data.frame(
           a = c( TRUE, FALSE, FALSE,  TRUE, TRUE),
           b = c( TRUE, FALSE,  TRUE, FALSE, TRUE),
           c = c(FALSE, FALSE,  TRUE, FALSE, TRUE)
          )

How I calculate the percentage difference doesn't really matter but they way I was going to do it would be to divide the absolute difference by the average of the 2 values so that I get the same value whichever way I look at it.

So for example to calculate the percentage difference between 100 and 105 it would be:

abs(100 - 105)/((100 + 105)/2) = 5/102.5 = 0.0488

Any ideas on the quickest and neatest way of doing this would be appreciated.

Thanks

user1165199
  • 6,351
  • 13
  • 44
  • 60

1 Answers1

2

Define a function an apply it on each row of your data.frame:

fun <- function(vec)
{
  n = length(vec)

  if(all(is.na(vec)))
    return(rep(FALSE,n))

  noNA = vec[!is.na(vec)]

  if(length(unique(noNA))==1)
    return(!is.na(vec))

  res = rep(FALSE, n)

  for(i in 1:n)
    if(any(abs(vec[i]-vec[-i])<=vec[-i]*0.1, na.rm = TRUE))
      res[i] = TRUE

  res
}

output=data.frame(t(apply(dataDF,1,fun)))
names(output) = names(dataDF)
output

Gives the wanted result:

#      a     b     c
#1  TRUE  TRUE FALSE
#2 FALSE FALSE FALSE
#3 FALSE  TRUE  TRUE
#4  TRUE FALSE FALSE
#5  TRUE  TRUE  TRUE
user1165199
  • 6,351
  • 13
  • 44
  • 60
Colonel Beauvel
  • 30,423
  • 11
  • 47
  • 87
  • Thanks Colonel, I have edited the above to put any(...) round the calculation to check if any of the columns are less than 10% rather than just the first one. Also had to put na.rm = TRUE to deal with cases where we had NA and at least different values in the other columns. It works fine like that, although is kind of slow as my dataframe is 100,000's rows long so has to loop througbh each of these. If there is a way of doing it without a loop that would be perfect, otherwise this will be fine. Thanks – user1165199 Dec 03 '14 at 13:20