2

I hope you can help me to find a solution, because my result is really unexpected…

I used the function expand.grid() to create a data frame from all combinations of the supplied vectors.

vector1=seq(from=0.8,to=1.6,by=0.2)
vector2=c(seq(from=0.8,to=1.8,by=0.2),2.6)
vector3=seq(from=0.6,to=1.2,by=0.2)

data=expand.grid(F1= vector1,F2= vector2,F3= vector3)
data
    F1  F2  F3
1   0.8 0.8 0.6
2   1.0 0.8 0.6
3   1.2 0.8 0.6
4   1.4 0.8 0.6
5   1.6 1.0 0.6
6   0.8 1.0 0.6
7   1.0 1.0 0.6 
…   …   …   …

Now I wanted to remove some rows with a logical comparison.

data_remove=which(data[,1]-data[,2]>0.2)
data_remove
[1] 3   4   5   8   …   110 113 114 115 120

Let’s take a look at row 113, because this is wrong – and perhaps some other entries in data_remove too.

data
    F1  F2  F3
…   …   …   …
113 1.2 1.0 1.2
…   …   …   …

data[113,1]- data[113,2]
[1] 0.2

(data[113,1]- data[113,2])>0.2
[1] TRUE

This result is confusing to me because

0.2>0.2
[1] FALSE

and

mode(data[113,1])
[1] “numeric”
mode(data[113,2])
[1] “numeric”

Can you explain me where’s my mistake?

Many thanks in advance!

tueftla
  • 369
  • 1
  • 3
  • 16
  • 4
    It's a variant of this classic question http://stackoverflow.com/q/9508518 . Due to numerical inaccuracies, the difference is not exactly `0.2`. Try to display `(data[113,1]- data[113,2]) - 0.2`. The result won't be exactly zero. – RHertel Jan 26 '17 at 08:51
  • Thanks for your comment. But how can I handle that problem? In the classic question you mentioned they are talking about `all.equal`… Okay I could round `data[113,1]-data[113,2]`. But is there a more elegant way? – tueftla Jan 26 '17 at 09:23

2 Answers2

1

The general issue: floating point arithmetics

As mentioned by RHertel in his comment, this has to do with floating point arithmetics and you can read more about it in the answers to this question. You will find everything you need there and I won't discuss this further, since I have nothing meaningful to add.

The solution to your specific example

Your specific example could be solved by working with whole numbers and only converting to the numbers that you actually want in the end. Also this method has its limitations that I will come back to in the end.

So, I basically start by defining the three vectors and the grid as follows:

vector1 <- seq(from = 8, to = 16, by = 2)
vector2 <- c(seq(from = 8, to = 18, by = 2), 26)
vector3 <- seq(from = 6, to = 12, by = 2)
data <- expand.grid(F1 = vector1, F2 = vector2, F3 = vector3)

In this way, I get numeric values that are 10 times larger than you defined them. But this will be easy to correct at the end by simply dividing by 10. The advantage is that for whole numbers, the comparison works as expected:

data_remove = which(data[,1] - data[,2] > 2)
head(data[data_remove, ])
##    F1 F2 F3
## 3  12  8  6
## 4  14  8  6
## 5  16  8  6
## 9  14 10  6
## 10 16 10  6
## 15 16 12  6

You can see that the condition is satisfied in all cases. In particular, the row 113 that you mentioned in your question is not removed this time. To get to the data that you actually wanted, you simply need to divide by 10:

data_new <- data[-data_remove, ]/10
head(data_new)
##     F1  F2  F3
## 1  0.8 0.8 0.6
## 2  1.0 0.8 0.6
## 6  0.8 1.0 0.6
## 7  1.0 1.0 0.6
## 8  1.2 1.0 0.6

Limitations of this method

I promised to come back to the limitations of this method. From a mathematical point of view, it always works, as long as you use only rational numbers. For instance,

seq(1/3, 5, by = 1/4)

can be rewritten with whole numbers as

seq(4, 60, by = 3)/12

The factor 12 occurs because 12 is the least common multiple of 3 and 4. However, this sequence can not be rewritten with integers because of the irrational numbers in it:

seq(sqrt(2), 7*sqrt(3), by = pi/5)

There is no factor q such that q * sqrt(2) and q * pi/5 are both whole numbers. But you could still solve the issue by rounding the numbers. Rounded to two digits after the comma, the sequence expressed in whole numbers is

seq(141, 1212, by = 63)/100

Another limitation may occur with very large numbers. If you have many relevant digits and therefore need to multiply the sequence by very large numbers, comparison will fail again:

(1e18 + 1) > 1e18
## [1] FALSE
Community
  • 1
  • 1
Stibu
  • 15,166
  • 6
  • 57
  • 71
  • "The advantage is that for integers [...]" -- in R lingo, those are not integers. Also, if the "integers" are large enough, comparisons again fail (as I guess you know): `1e44+1L > 1e44 # FALSE` – Frank Jan 27 '17 at 21:03
  • 1
    @Frank Thanks for your comment. I am aware that they are not integers in the sense of data type. I thought about writing "whole numbers" instead but was not sure whether the term would be understood. What do you think? I really did not think about very large integers, but you are right of course. I will add a comment where I discuss the limitations of the method. – Stibu Jan 28 '17 at 07:53
  • Is there any possibility to overcome the problem with very large numbers? – tueftla Jan 30 '17 at 10:14
  • If you have very large numbers in your problem, you could make them smaller by the opposite procedure: divide by some number to make them smaller and multiply again at the end. If this is not possible, it probably means that your vectors are very large and you might end up with memory issues, when you try to use `grid.expand()`. If you absolutely need to work with large integers, you could use the `gmp` package, e.g., `gmp::as.bigz(123456789012345678901234567890)` – Stibu Jan 30 '17 at 17:31
0

In addition to Stibus detailed answer (Thanks a lot)…

My answer resulting of the hint of RHertel – two solutions.

Let’s have a look at the vector data_remove and specify the entries which are wrong (8, 43, 78, 113).

data_remove
[1]   3   4   5   8   9  10  15  38  39  40  43  44  45  50  73  74  75  78  79  80  85 108 109 110 113 114 115 120
length(data_remove)
[1]  28

My first solution is to use the round-function. Here you have to define the digits-argument.

data_remove1=which(round(data[,1]-data[,2],4)>0.2)
data_remove1
[1]   3   4   5   9  10  15  38  39  40  44  45  50  73  74  75  79  80  85 108 109 110 114 115 120
length(data_remove1)
[1] 24

When you increase the digits-argument up to 16 and higher the four wrong entries appear again in the vector.

data_remove1=which(round(data[,1]-data[,2],16)>0.2)
data_remove1
[1]   3   4   5   8   9  10  15  38  39  40  43  44  45  50  73  74  75  78  79  80  85 108 109 110 113 114 115 120
length(data_remove1)
[1] 28

data_remove1=which(round(data[,1]-data[,2],22)>0.2)
data_remove1
[1]   3   4   5   8   9  10  15  38  39  40  43  44  45  50  73  74  75  78  79  80  85 108 109 110 113 114 115 120
length(data_remove1)
[1] 28

My second solution uses the vectorisation of the function all.equal. Here it is also possible to change the tolerance to your needs.

data_critical is a vector with entries, where the subtraction of data[,1] and data[,2] is nearly exact 0.2.

elementwise.all.equal=Vectorize(function(x,y,z) {isTRUE(all.equal(x,y,z))})
data_critical=which(elementwise.all.equal(data[,1]-data[,2],rep(0.2,length.out=length(data[,1])),1e-15)==TRUE)
data_critical
[1]   2   8  14  20  37  43  49  55  72  78  84  90 107 113 119 125
data_remove_correct=match(data_critical,data_remove)
data_remove_correct
[1] NA  4 NA NA NA 11 NA NA NA 18 NA NA NA 25 NA NA
data_remove_correct=data_remove_correct[!is.na(data_remove_correct)]
data_remove_correct
[1]  4 11 18 25
data_remove_perfect=data_remove[-data_remove_correct]
data_remove_perfect
[1]   3   4   5   9  10  15  38  39  40  44  45  50  73  74  75  79  80  85 108 109 110 114 115 120
length(data_remove_perfect)
[1] 24

Why not all data_critical is represented in data_remove? Watch the result of the subtraction – only positive results appear in the vector data_remove.

data[2,1]-data[2,2]-0.2
[1] -5.551115e-17
data[8,1]-data[8,2]-0.2
[1] 1.665335e-16
tueftla
  • 369
  • 1
  • 3
  • 16