I’d like to class a data frame in a certain way in R
.
Assume to have a data frame like the following:
> data = sample(1:500, 5000, replace = TRUE)
In order to class this data frame I’m making these classes:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
> table(data.cl)
data.cl
(0,10] (10,20] (20,30] (30,40] (40,50]
102 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500]
1002 1492 1318 194
If I want 0
to be included I’d just have to add include.lowest = TRUE
:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE)
> table(data.cl)
data.cl
[0,10] (10,20] (20,30] (30,40] (40,50]
102 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500]
1002 1492 1318 194
In this example this doesn’t show any difference, because 0
isn’t occuring in this data frame at all. But if it would, e.g. 4
times, there would be 106
instead of 102
elements in class [0,10]
:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE)
> table(data.cl)
data.cl
[0,10] (10,20] (20,30] (30,40] (40,50]
106 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500]
1002 1492 1318 194
There is another option in changing class limits. The default option for cut()
is right = FALSE
. If you change it to right = TRUE
you’ll get:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE, right = FALSE)
> table(data.cl)
data.cl
[0,10) [10,20) [20,30) [30,40) [40,50)
92 81 87 111 118
[50,60) [60,70) [70,80) [80,90) [90,100)
103 89 94 103 103
[100,200) [200,350) [350,480) [480,500]
1003 1497 1320 199
include.lowest
now becomes “include.highest
” at the price of changing class limits and thus returning different amounts of class members in some classes, because of a slight shift in class limits.
But if I want to have the data frame
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
> table(data.cl)
data.cl
(0,10] (10,20] (20,30] (30,40] (40,50]
102 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500)
1002 1492 1318 194
to exclude 500
, too, what shall I do?
Of course, one can say: “Just write data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 499))
instead of data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
, because you’re dealing with integer numbers.”
Well, that’s right, but what would be if this wouldn’t be the case and I’d use floats instead? How can I exclude 500
then?