2

I’d like to class a data frame in a certain way in R.
Assume to have a data frame like the following:

> data = sample(1:500, 5000, replace = TRUE)

In order to class this data frame I’m making these classes:

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
> table(data.cl)
data.cl
   (0,10]   (10,20]   (20,30]   (30,40]   (40,50] 
      102        80        87       113       117 
  (50,60]   (60,70]   (70,80]   (80,90]  (90,100] 
      101        89        95       106       104 
(100,200] (200,350] (350,480] (480,500] 
     1002      1492      1318       194 

If I want 0 to be included I’d just have to add include.lowest = TRUE:

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE)
    > table(data.cl)
data.cl
   [0,10]   (10,20]   (20,30]   (30,40]   (40,50] 
      102        80        87       113       117 
  (50,60]   (60,70]   (70,80]   (80,90]  (90,100] 
      101        89        95       106       104 
(100,200] (200,350] (350,480] (480,500] 
     1002      1492      1318       194 

In this example this doesn’t show any difference, because 0 isn’t occuring in this data frame at all. But if it would, e.g. 4 times, there would be 106 instead of 102 elements in class [0,10]:

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE)
    > table(data.cl)
data.cl
   [0,10]   (10,20]   (20,30]   (30,40]   (40,50] 
      106        80        87       113       117 
  (50,60]   (60,70]   (70,80]   (80,90]  (90,100] 
      101        89        95       106       104 
(100,200] (200,350] (350,480] (480,500] 
     1002      1492      1318       194 

There is another option in changing class limits. The default option for cut() is right = FALSE. If you change it to right = TRUE you’ll get:

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE, right = FALSE)
> table(data.cl)
data.cl
   [0,10)   [10,20)   [20,30)   [30,40)   [40,50) 
       92        81        87       111       118 
  [50,60)   [60,70)   [70,80)   [80,90)  [90,100) 
      103        89        94       103       103 
[100,200) [200,350) [350,480) [480,500] 
     1003      1497      1320       199 

include.lowest now becomes “include.highest” at the price of changing class limits and thus returning different amounts of class members in some classes, because of a slight shift in class limits.
But if I want to have the data frame

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
> table(data.cl)
data.cl
   (0,10]   (10,20]   (20,30]   (30,40]   (40,50] 
      102        80        87       113       117 
  (50,60]   (60,70]   (70,80]   (80,90]  (90,100] 
      101        89        95       106       104 
(100,200] (200,350] (350,480] (480,500) 
     1002      1492      1318       194

to exclude 500, too, what shall I do?
Of course, one can say: “Just write data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 499)) instead of data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500)), because you’re dealing with integer numbers.”
Well, that’s right, but what would be if this wouldn’t be the case and I’d use floats instead? How can I exclude 500 then?

Ferdinand.kraft
  • 12,579
  • 10
  • 47
  • 69
Incognito
  • 41
  • 6
  • Warning: these are not "classes." "Class" has a very specific meaning in programming, and in `R` in particular. Anyway, you don't have to use `cut` ; if all else fails, split your dataset using `plyr` tools or a series of `data.x <- data[data>480 & data <=500]` vs, say, `data.y<=data[data>480 & data<500]` – Carl Witthoft Sep 08 '13 at 18:51
  • It's easier of you exclude `500` and then use `cut`. You can do something like `data[data==500] <- Inf`. But beware of [this](http://stackoverflow.com/questions/2769510/numeric-comparison-difficulty-in-r). – Ferdinand.kraft Sep 08 '13 at 21:38

0 Answers0