Distributing numeric variable into 3 categorical variables

Question

I am new here so please let me know if I can improve myself to be clearer.

I would like to predict absenteeism of employees, so I have to make a factor of this numerical variable. The data is skewed right, so I would like to transfer the entries equal over every category. I prefer to have a new variable "Group" that divides all observation equal in to 1, 2 or 3.

The problem is that I have an issue with making this factor with equal n. I tried many possibilities from this topic: splitting a continuous variable into equal sized groups, such as cut, cut2 and Hmisc. All option seem straightforward, but if I apply them, the categorisch are not equal divided.

I hope someone can help me, I am really curious why the above methods are not working for me. I would like an answer from a basic library. Below is a snap of my data:

structure(list(ID = c(11, 36, 3, 7, 11, 3), Reason_absence = c(26,
0, 23, 7, 23, 23), Age = c(33, 50, 38, 39, 33, 38), BMI2 = c(30,
31, 31, 24, 30, 31), Absenteeism_time = c(4, 0, 2, 4, 2, 2)), class =
c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))

The total dataset consist of 700 entries and 21 columns.

Thanks in advance!

if those methods didn't work for you, you probably will have to give a [mcve] so we can figure out why not. See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example and the `reprex` package for help in constructing a reproducible example ... (How many observations do you have? If you had [for example] 8 observations, it would be impossible to divide them into 3 equal groups ...) — Ben Bolker, Jan 06 '20 at 19:15
Dear Ben, thanks for your comment! I added reproducible example. — Henk, Jan 06 '20 at 20:36
so you want to factorize all numeric columns in equal groups ? — YOLO, Jan 06 '20 at 21:02
Hi YOLO, thanks for your comment! I only want to factorize the column Absentieems_time. — Henk, Jan 07 '20 at 12:52

JacobJacox · Answer 1 · 2020-01-06T21:44:39.100

The easiest way would probably be the empirical cumulative function, and then checking the percentile your value is at. For example:

a = c(1,2,3,5,6,13,30,45,100,110,120,125)
plot(ecdf(a))
b = ecdf(a)
b(2)
[1] 0.1666667
b(30)
[1] 0.5833333
b(120)
[1] 0.9166667

Now values below 0.33 will be in group A, values in between 0.33 and 0.66 will be in group B and other in group C. We can easily make a function that does that:

emp_cdf = ecdf(a)
f = function(new_value,n=3,ecdf_=emp_cdf ){
  # n is the number of groups + 1 
  groups =  c(NA,LETTERS[1:n]) 
  thresholds = seq(-0.0000001,1,length.out=n+1)
  # we want that lowest value will be assigned to first group so that why -0
  tmp_val = ecdf_(new_value)
  groups[which(tmp_val <= thresholds)[1]]

}

By construction values should now be equally distributed plus you have a prediction function to handle new values.

f(0)
[1] "A"
f(200)
[1] "C"
f(26)
[1] "B"

Distributing numeric variable into 3 categorical variables

1 Answers1