0

I have a data frame with categorical and numeric variables. I try to onehot transform the categories, while the numerics should stay unchanged.

It took me some time to reproduce the issue that I get with my own data, but apparently it comes from some of the numeric columns loaded as integers.

So the following code should reproduce the issue I got:

library(onehot)

set.seed(123)

d_tmp<-data.frame(a=as.integer(sample(c(120, 94, 140, 100, 130, NA),10, replace = T)),
 b=sample(c("I", "II", NA), 10, replace=T))

data.frame(predict(onehot(d_tmp), d_tmp))

Outcome:

1          140   0    1
2  -2147483648   1    0
3          140   0    1
4           94   0    0
5           94   1    0
6  -2147483648   0    0
7          140   0    0
8          130   1    0
9          100   1    0
10 -2147483648   1    0

So the NA values are replaced by some highly negative numbers, which seem arbitrary to me. While trying to reproduce the dataframe, I figured that this happens only if I add the as.integer() to the dataframe creation (my own original data is loaded as integer by default).

Why is this happening? And how should I handle this robustly? Of cause I could convert all numeric columns to numeric, however I am not 100% sure if this is fixing the problem once and for all. I want to know the reason, so I don't have to worry about any implicit data errors later. I hope I am addressing this correctly here, if you think, this problem should be addressed somewhere else, please let me know.

Thanks for the help.

aldorado
  • 4,394
  • 10
  • 35
  • 46
  • if not sworn to that package you could try with standard tools -- relevant: https://stackoverflow.com/questions/5616210/model-matrix-with-na-action-null ; https://stackoverflow.com/questions/4560459/all-levels-of-a-factor-in-a-model-matrix-in-r – user20650 Oct 01 '19 at 10:51
  • ... and its probably worth raising a bug report with the package author as it seems to be relacing the missing with the max. integer – user20650 Oct 01 '19 at 10:52

0 Answers0