2

I am not able to understand the below behavior of cut.

> data = seq(0,1,.2)
> data
[1] 0.0 0.2 0.4 0.6 0.8 1.0
> cuts = cut(data, c(0, 0.25, 0.5, .6, 0.9, Inf))
> summary(cuts)
  (0,0.25] (0.25,0.5]  (0.5,0.6]  (0.6,0.9]  (0.9,Inf]       NA's 
         1          1          0          2          1          1

As per my understanding the intervals made by cut are closed on right. Thus the interval (0.5,0.6] should have one element (.6) instead of zero. Similarly interval (0.6,0.9] should have 1 element only instead of 2.

Where am I wrong.

artemis
  • 581
  • 1
  • 4
  • 13

1 Answers1

2

It has to do with a slight error in the numbers that are generated by seq:

> data[4] - 0.6
[1] 1.110223e-16

From that, you can see that data[4] is ever so slightly larger than 0.6, hence it goes up to the next bucket.

The reason for this is because not all numbers can be represented exactly in any encoding scheme that doesn't have infinite storage. The best you can hope for is a close enough approximation. In this case, an error of 10-16 for a value of order 10-1 is minuscule, but non-zero.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • If you are going for the low hanging fruit with your answers, you should make them at least high quality. That would mean explaining floating point number precision. You can save the effort here because there is an excellent answer to the duplicate already. I suggest that you delete this answer. – Roland Apr 02 '15 at 07:13
  • 1
    @Roland, I have updated with more explanation. I would _hope_ you would remove the negative vote since that's _supposed_ to be used for unhelpful answers. If you disagree with the _question,_ you should close (as you have) and possibly delete the question. I don't delete answers based on the say so of one member, those that are net negative after some time are indeed culled, since the SO "swarm" is better suited to judge than any one member. – paxdiablo Apr 02 '15 at 07:37
  • 1
    I note, for example, http://stackoverflow.com/questions/11985799/converting-date-to-a-day-of-week-in-r/11985801#11985801, where you have an answer to a question that's been closed as dupe. So I imagine you'll be deleting that, yes? :-) – paxdiablo Apr 02 '15 at 07:40
  • I've removed my downvote. However, note that the example you found in my early contributions was posted hours before the question was closed as duplicate (on a different SE site) whereas your answer (which did not really answer the question) was posted after this question was closed as duplicate. I have actually thought about deleting my old answer because I find it somewhat embarrassing, but apparently that duplicate seems to be popular and I don't want to delete the base R solution and leave only the package solutions there. – Roland Apr 02 '15 at 07:49
  • Sorry, @Roland, couldn't resist having a jab :-) I suspect this question will end up being deleted anyway, or possibly merged, though it may be a bit iffy if the questions themselves aren't near _exact_ duplicates but only have a lot of crossover in the answers. Anyway, we'll see. – paxdiablo Apr 02 '15 at 08:16
  • @Roland As a lower-ability user, I find this question-answer extremely useful. I would never have found the linked question-answer when searching for "cut" "seq" because I would not have realized that it was an issue with the number storage itself. It is not a duplicate from my perspective. – user3386170 Jan 22 '18 at 19:28