1

Why don't these two return the same result?

    D = data.frame( x=c( 0.6 ) )

    D$binned = cut( D$x, seq( 0.50,0.70,0.025 ), include.lowest=TRUE, right=FALSE )
    D # 0.6 is binned correctly as [0.6,0.625)

    D$binned = cut( D$x, seq( 0.55,0.65,0.025 ), include.lowest=TRUE, right=FALSE )
    D # 0.6 is binned incorrectly as [0.575,0.6)
baixiwei
  • 1,009
  • 4
  • 20
  • 27

2 Answers2

5

Representation error. Floating point approximation of numbers is only exact if the number is a combination of certain powers of 2. Other numbers are mapped to these numbers. Different algorithms to produce a number may do so in different ways and have different errors associated with them (ie above or below the expected value). In this case:

print(D$x,digits=22)
[1] 0.5999999999999999777955
print(seq(0.5,0.7,0.025)[5],digits=22)
[1] 0.5999999999999999777955
> print(seq(0.55,0.65,0.025)[3],digits=22)
[1] 0.6000000000000000888178
James
  • 65,548
  • 14
  • 155
  • 193
  • Unfortunately, not really. The errors are consistent, but ultimately the value depends on how it is calculated. The usual way of dealing with this is to only consider equality within a certain tolerance, however `cut` needs sharp break points. – James Jul 12 '13 at 15:00
  • However, if your numbers will always only have a few decimal points, you could bump the break points accordingly (eg, `seq( 0.55,0.65,0.025 ) - 0.000001` and see if that helps. – Aaron left Stack Overflow Jul 12 '13 at 15:16
  • 2
    You might want to look at the source code for `hist.default` to see one approach – hadley Jul 13 '13 at 07:13
1

D$binned = cut( D$x, round(seq( 0.55,0.65,0.025 ),3), include.lowest=TRUE, right=FALSE )

D

x binned

1 0.6 [0.6,0.625)

Fabio Marroni
  • 423
  • 8
  • 19