2

I have a large dataset in csv format to build a prediction model. Because of its size, I planned to use h2o package in R to build the model. However, the data, in multiple columns of the data.frame, contains some Chinese Simplified characters and h2o is having difficulty receiving the data.

I've tried two different approaches. The first approach involved directly reading from the file using the h2o.importFile() function to import the data. However, this approach ends up converting the Chinese characters into some messy codes.

The second approach I've tried to first bring the data into R using readr and base R read_csv/read.csv functions. After the data is loaded correctly into R, I tried to convert the data.frame into h2o frame using as.h2o function. Though, the end result of this approach also resulted in a messed up translation.

To illustrate, I've written the following piece of codes as an example:

require(h2o)
dat<-data.frame(x=rep(c("北京","上海"),50),
                y=rnorm(mean=10,sd=3,n=100))
h2o.init(nthreads=-1)
h2o.dat<-as.h2o(dat)
coatless
  • 20,011
  • 13
  • 69
  • 84
Felix Zhao
  • 459
  • 5
  • 9

3 Answers3

2

I would consider this a bug since R's data.frame can display the characters, but at the same time, the R H2OFrame cannot. I checked that this works for H2OFrames in Python, so it's an R issue only. I filed a bug here.

Update: This has been fixed (I have checked that it's working in H2O 3.32.0.1, but it was probably fixed a while ago).

Erin LeDell
  • 8,704
  • 1
  • 19
  • 35
1

I dont know if it is the best way but I have worked on Korean data before and this the process I generally follow. First, ensure that the data you need to read is encoded as "UTF-8". Second, ensure that the locale is set to English

Sys.getlocale(category="LC_ALL")

You can then read the file using the below statement,

dat <- read.csv("Test.txt",header=T,encoding = "UTF-8",stringsAsFactors = F)

dat[,1]
[1] "北京" "上海" "北京" "上海"

dat
        X.U.FEFF.X Y
1 <U+5317><U+4EAC> 1
2 <U+4E0A><U+6D77> 2
3 <U+5317><U+4EAC> 3
4 <U+4E0A><U+6D77> 4

As you can see, when you view the entire data.frame you see them as "UTF-8" encodes but you can also look at the chinese characters by looking using df[1,] and looking at each vector.

ab90hi
  • 435
  • 1
  • 4
  • 11
  • Hi @ab90hi, thanks for your advice. Actually, I had no problem in reading the original dataset into R and show them as appropriate Chinese characters by using read_csv from dplyr. The challenge is to import or transform the original data set into an H2OFrame and show them appropriately. – Felix Zhao Jan 13 '17 at 12:40
1

Your problem is only related with R not showing encoded character inside H2O frames however the data inside h2o frames is still totally preserved as in original frame. Once you use H2O Web/FLOW UI and see the h2o frame you will see data inside h2o frame is exactly same as original frame. The following image shows results at various location i.e RStudio, R view window and in H2O FLOW UI

enter image description here

Please following the link below for a solution however you must be able to update locals in your machine to view those characters in the H2O data frames:

how to read data in utf-8 format in R?

Community
  • 1
  • 1
AvkashChauhan
  • 20,495
  • 3
  • 34
  • 65