1

I'm new to R and can not get my head arround why some very basic script does not perform one hot encoding in a windows-environment while it performs totally well in a linux-environment. As I have to work within the failing windows-environment I'd like to make the script perform one hot encoding.

This happenes within windows (one hot fail)

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.14.2 mltools_0.3.5

loaded via a namespace (and not attached):
[1] compiler_4.1.1  Matrix_1.3-4    tools_4.1.1     grid_4.1.1 lattice_0.20-44
>
> customers <- data.frame(
+     id=c(10, 20, 30, 40, 50),
+     gender=c('male', 'female', 'female', 'male', 'female'),
+     mood=c('happy', 'sad', 'happy', 'sad','happy'),
+     outcome=c(1, 1, 0, 0, 0))
>
> customers
  id gender  mood outcome
1 10   male happy       1
2 20 female   sad       1
3 30 female happy       0
4 40   male   sad       0
5 50 female happy       0
>
> library(data.table)
> library(mltools)
>
> customers_1h <- one_hot(as.data.table(customers))
>
> customers_1h
   id gender  mood outcome
1: 10   male happy       1
2: 20 female   sad       1
3: 30 female happy       0
4: 40   male   sad       0
5: 50 female happy       0 

while this is what I'd expect to happen - one hot encoding

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Leap 15.3

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8       
 [4] LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0   
> 
> customers <- data.frame(
+     id=c(10, 20, 30, 40, 50),
+     gender=c('male', 'female', 'female', 'male', 'female'),
+     mood=c('happy', 'sad', 'happy', 'sad','happy'),
+     outcome=c(1, 1, 0, 0, 0))
> 
> customers
  id gender  mood outcome
1 10   male happy       1
2 20 female   sad       1
3 30 female happy       0
4 40   male   sad       0
5 50 female happy       0
> 
> library(data.table)
data.table 1.14.2 using 8 threads (see ?getDTthreads).  Latest news: r-datatable.com
> library(mltools)
> 
> customers_1h <- one_hot(as.data.table(customers))
> 
> customers_1h
   id gender_female gender_male mood_happy mood_sad outcome
1: 10             0           1          1        0       1
2: 20             1           0          0        1       1
3: 30             1           0          1        0       0
4: 40             0           1          0        1       0
5: 50             1           0          1        0       0

At least the same packages seem to be installed. So why does one hot encoding not take place without at least some error? Can anyone give me a hint how I get windows behaving?

Many thanks in advance

Chris

r2evans
  • 141,215
  • 6
  • 77
  • 149
Christian
  • 133
  • 1
  • 2
  • 7
  • Does this behave differently if you are not using `data.table`? It seems a bit of a red herring here, I'd think. – r2evans Oct 19 '21 at 20:16
  • Omitting ```# library(data.table)``` yields ```> customers_1h <- one_hot(as.data.table(customers)) Error in as.data.table(customers) : could not find function "as.data.table"``` - at least a decent error on the windows-side while Linux doesn't even notice. – Christian Oct 19 '21 at 20:24
  • 1
    Not using `library(data.table)` causes the missing `as.data.table()`. But removing this still causes the same behaviour. – Martin Gal Oct 19 '21 at 20:31
  • If the problem has nothing to do with the `data.table` package (and specifically your use of `as.data.table`), then I suggest you reduce your code. – r2evans Oct 19 '21 at 20:35
  • Thanks @MartinGal. – r2evans Oct 19 '21 at 20:35

1 Answers1

4

I think this has to do with your R versions, not the platform. One of the key defaults for creating data.frames, stringsAsFactors, got a new default (=FALSE) in R 4.0 after years of tripping up unsuspecting new users. However, some packages, such as it seems mltools, expect the kind of data frame that would be created using the old default, stringsAsFactors = TRUE. For more: https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/index.html

I was able to replicate the problem and could fix it by setting stringsAsFactors = TRUE. (BTW, it looks like mltools::onehot expects a data.table as input, so I'm not sure there's a way to avoid using that package.)

Doesn't work:

customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0))

mltools::one_hot(data.table::as.data.table(customers))

   id gender  mood outcome
1: 10   male happy       1
2: 20 female   sad       1
3: 30 female happy       0
4: 40   male   sad       0
5: 50 female happy       0

Works:

customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0), stringsAsFactors = TRUE)

mltools::one_hot(data.table::as.data.table(customers))


   id gender_female gender_male mood_happy mood_sad outcome
1: 10             0           1          1        0       1
2: 20             1           0          0        1       1
3: 30             1           0          1        0       0
4: 40             0           1          0        1       0
5: 50             1           0          1        0       0
Jon Spring
  • 55,165
  • 4
  • 35
  • 53