1

I'm trying to get the number of values that are over a certain threshold in a column in a data frame with decimal values ranging from 0 to 1. To do so, I use sapply to iterate over a list of thresholds. When I supply a defined vector of thresholds, sapply works fine but when I use seq() to define the thresholds I get weird results(with repetitions) and the results do not match. This only happens with decimals and not with whole numbers.

t <- data.frame(replicate(10,sample((0:10)/10,1000,rep=TRUE)))

l <- c()
l <- sapply(c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9), function(x){
    nrow(t[t[,"X1"]>=x,]);
});

l2 <- c()
l2 <- sapply(seq(0, 0.9, 0.1), function(x){
    nrow(t[t[,"X1"]>=x,]);
});

print(l)
print(l2)

Output:

> print(l)
 [1] 1000  909  811  723  626  530  443  365  275  187
> print(l2)
 [1] 1000  909  811  626  626  530  365  275  275  187

When the same code is executed with integers and integer thresholds, l and l2 match perfectly.

Code for whole numbers:

t <- data.frame(replicate(10,sample(0:10,1000,rep=TRUE)))

l <- c()
l <- sapply(c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9), function(x){
    nrow(t[t[,"X1"]>=x,]);
});

l2 <- c()
l2 <- sapply(seq(0, 9, 1), function(x){
    nrow(t[t[,"X1"]>=x,]);
});

print(l)
print(l2)

Output:

> print(l)
 [1] 1000  915  816  729  643  555  468  367  270  188
> print(l2)
 [1] 1000  915  816  729  643  555  468  367  270  188

I'm not sure if I'm missing something very basic or making a mistake.

Thank you.

Karthik
  • 676
  • 1
  • 8
  • 20

2 Answers2

2

It's because seq() doesn't produce exactly the decimal values you are expecting:

> seq(0, 0.9, 0.1)[4] == 0.3
[1] FALSE

Accounting for the tiny deviations (floating point errors) from the exact decimals using all.equal recovers the "equality"

> all.equal(seq(0, 0.9, 0.1)[4], 0.3)
[1] TRUE

The integer version is not subject to the same floating point errors hence you see consistent behaviour of your two approaches.

This is an instance of R FAQ 7.31

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • I suspected it had something to do with floating point but didn't know where to look. Thank you for the links! – Karthik Mar 23 '18 at 05:39
1

Resolve this with:

grt_or_near <- function (x, y, tol = .Machine$double.eps^0.5) 
{
  (x > y) | (abs(x - y) < tol)
}

t <- data.frame(replicate(10,sample((0:10)/10,1000,rep=TRUE)))
l <- sapply(c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9), function(x){
  nrow(t[grt_or_near(t[,"X1"],x),])
})


l2 <- sapply(seq(0, 0.9, 0.1), function(x){
  nrow(t[grt_or_near(t[,"X1"],x),])
})
l
# [1] 1000  924  830  759  664  570  480  374  290  186
l2
# [1] 1000  924  830  759  664  570  480  374  290  186
De Novo
  • 7,120
  • 1
  • 23
  • 39