1

I am using H2O and R for a binary classification problem. The dataset has over 800 features and some of them include non-english names and characters, for example 'ö'.

I am getting the following error message:

Error in .verify_dataxy(params$training_frame, x, y): Invalid column names

Then the list of columns with the problematic characters.

I have already googled and searched SO for a documentation about the settings regarding accepted languages in H2O.

Here is a sample code:

library(h2o)
h2o.init()
sodata <- data.frame(Erklärung = sample(c(0,1), 50, replace = TRUE),
                 isPot = sample(c(0,1), 50, replace = TRUE),
                 target = sample(c(0,1), 50, replace = TRUE))
#
tar <- "target"
pr <- setdiff(colnames(sodata), tar)
sohex <- as.h2o(sodata)
spl <- h2o.splitFrame(data = sohex, ratios = .7, seed = 1)
training <- spl[[1]]
testing <- spl[[2]]
#
gbm1 <- h2o.gbm(x = pr, 
                y = tar, 
                training_frame = training, 
                validation_frame = testing)
#
#h2o.shutdown()

The error message is

Error in .verify_dataxy(training_frame, x, y):
  Invalid column names: Erklärung

Is there a way to change the accepted language in H2O?

Edit: session and environment info,

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64_w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

Under the displayed settings after Sys.getenv() there is nothing language related.

maop
  • 194
  • 14

1 Answers1

1

Edit based on your update: the ".1252" you see is not Unicode. See https://en.wikipedia.org/wiki/Windows-1252

This answer shows some ways to change the locale for R. (You might also want to look into ways to set the default locale for mingw, if you don't want to set this in R each time.) I'll paste in my sessionInfo output below, but I think anything that shows .UTF-8 at the end of each will be fine, e.g. "de_DE.UTF-8"

BTW, one workaround is to strip out the special characters, see Remove accents from a dataframe column in R for a couple of ways you could do this. E.g.

sodata <- ...
...
colnames(sodata) <- iconv(colnames(sodata),to="ASCII//TRANSLIT")
sohex <- as.h2o(sodata)
...

An unhelpful "works for me". I'm using h2o 3.22 (which is not that recent) with R 3.4.4, on Linux. You didn't say which line you got the error on, but after doing as.h2o() I can see "Erklärung" in the column headers, and the same when looking at training and testing. And when doing summary(gbm1) on the produced model, I see the umlaut in the variable importances:

   variable relative_importance scaled_importance percentage
1     isPot            0.676265          1.000000   0.708690
2 Erklärung            0.277981          0.411054   0.291310

My guess would be that you need to make sure your script is in UTF-8. And maybe check the locale you are running your R session in?

My sessionInfo() (running in RStudio; R from the commandline has identical locale settings):

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 19.1

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8   
 [6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.4 tools_3.4.4 
Darren Cook
  • 27,837
  • 13
  • 117
  • 217
  • Thanks for the answer. Edited my post and added the locale settings. – maop Aug 21 '19 at 07:45
  • I got the error right after running the `h2o.gbm`. (Or `h2o.grid`). The encoding is UTF-8. – maop Aug 21 '19 at 07:56
  • Unfortunately, I am in a highly regulated environment and cannot edit/change .Rprofile, or system settings as I lack administrative rights. Sys.setlocale() does not allow me to make changes. Sys.setenv(LANG) works but does not change anything in the locale settings. If there is any h2o package specific setting that allows the umlaut in the column names, is what I'd like to do. (The normal session, `data.table` or `data.frame` works with these column names.) – maop Aug 21 '19 at 09:33
  • 1
    @maop Maybe you could try starting h2o from the commandline, instead of using `h2o.init()`. Or, if you call `h2o.init()` after your `Sys.setenv()` call, does it work? But I'd also try to persuade your system administrators to embrace UTF8/Unicode and move on from 20th Century encodings :-) – Darren Cook Aug 23 '19 at 08:05
  • Persuading sys admins is a good advice :) I tried initializing h2o after I made sure that I called setenv(). I kind of gave up and changed the column names using `setnames()` and assigning new names like this: `paste0("v", seq(1, length(predictors), 1))`. – maop Aug 23 '19 at 08:10
  • @maop A less destructive way to change the names is using `iconv`; see my update. – Darren Cook Aug 24 '19 at 07:09