I have a data frame with categorical and numeric variables. I try to onehot transform the categories, while the numerics should stay unchanged.
It took me some time to reproduce the issue that I get with my own data, but apparently it comes from some of the numeric columns loaded as integers.
So the following code should reproduce the issue I got:
library(onehot)
set.seed(123)
d_tmp<-data.frame(a=as.integer(sample(c(120, 94, 140, 100, 130, NA),10, replace = T)),
b=sample(c("I", "II", NA), 10, replace=T))
data.frame(predict(onehot(d_tmp), d_tmp))
Outcome:
1 140 0 1
2 -2147483648 1 0
3 140 0 1
4 94 0 0
5 94 1 0
6 -2147483648 0 0
7 140 0 0
8 130 1 0
9 100 1 0
10 -2147483648 1 0
So the NA values are replaced by some highly negative numbers, which seem arbitrary to me. While trying to reproduce the dataframe, I figured that this happens only if I add the as.integer() to the dataframe creation (my own original data is loaded as integer by default).
Why is this happening? And how should I handle this robustly? Of cause I could convert all numeric columns to numeric, however I am not 100% sure if this is fixing the problem once and for all. I want to know the reason, so I don't have to worry about any implicit data errors later. I hope I am addressing this correctly here, if you think, this problem should be addressed somewhere else, please let me know.
Thanks for the help.