0

I have a text with data. The text is non-English with European-style numbers and tab-delimited values:

Ežeras  Plotas, ha  Gylis, m
Drūkšiai    4479,0  33,3
Dysnai  2439,4  6,0

I want to read it into R using functions from readr package, but I face an encoding issue in the resulting dataset.

The code:

Sys.setlocale(locale = "Lithuanian")

library(readr)

read_tsv(locale = locale(decimal_mark = ","),
"Ežeras     Plotas, ha  Gylis, m
Drūkšiai    4479,0  33,3
Dysnai  2439,4  6,0
")

The result:

# A tibble: 2 x 3
  `E\xfeeras`      `Plotas, ha` `Gylis, m`
  <chr>                   <dbl>      <dbl>
1 "Dr\xfbk\xf0iai"        4479        33.3
2 Dysnai                  2439.        6  

I also tried encoding = "native" and encoding = "unknown" inside the function locale(), but these options are not recognized.

I can write the data into a text file and read that file as well as use data.table::fread(), but these are not the options I am searching for.


devtools::session_info()

Session info --------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, mingw32             
 ui       RStudio (1.1.456)           
 language (EN)                        
 collate  Lithuanian_Lithuania.1257   
 tz       Europe/Helsinki             
 date     2018-10-15                  

Packages ------------------------------------------------------------------------
 package    * version    date       source                          
 assertthat   0.2.0      2017-04-11 CRAN (R 3.5.0)                  
 base       * 3.5.1      2018-07-02 local                           
 cli          1.0.1      2018-09-25 CRAN (R 3.5.1)                  
 compiler     3.5.1      2018-07-02 local                           
 crayon       1.3.4      2017-09-16 CRAN (R 3.5.0)                  
 datasets   * 3.5.1      2018-07-02 local                           
 devtools     1.13.6     2018-06-27 CRAN (R 3.5.1)                  
 digest       0.6.18     2018-10-10 CRAN (R 3.5.1)                  
 fansi        0.4.0      2018-10-05 CRAN (R 3.5.1)                  
 graphics   * 3.5.1      2018-07-02 local                           
 grDevices  * 3.5.1      2018-07-02 local                           
 hms          0.4.2.9001 2018-07-25 Github (tidyverse/hms@979286f)  
 memoise      1.1.0      2017-04-21 CRAN (R 3.5.0)                  
 methods    * 3.5.1      2018-07-02 local                           
 pillar       1.3.0      2018-07-14 CRAN (R 3.5.1)                  
 pkgconfig    2.0.2      2018-08-16 CRAN (R 3.5.1)                  
 R6           2.3.0      2018-10-04 CRAN (R 3.5.1)                  
 Rcpp         0.12.19    2018-10-01 CRAN (R 3.5.1)                  
 readr      * 1.1.1      2017-05-16 CRAN (R 3.5.1)                  
 rlang        0.2.2      2018-08-16 CRAN (R 3.5.1)                  
 rstudioapi   0.8        2018-10-02 CRAN (R 3.5.1)                  
 stats      * 3.5.1      2018-07-02 local                           
 tibble       1.4.2      2018-01-22 CRAN (R 3.5.0)                  
 tools        3.5.1      2018-07-02 local                           
 utf8         1.1.4      2018-05-24 CRAN (R 3.5.0)                  
 utils      * 3.5.1      2018-07-02 local                           
 withr        2.1.2      2018-09-05 Github (jimhester/withr@8b9cee2)
 yaml         2.2.0      2018-07-25 CRAN (R 3.5.1)   
GegznaV
  • 4,938
  • 4
  • 23
  • 43
  • 1
    I tried on a Mac box with all US locales == "en_US.UTF-8" and get a result with none of those hex codes. See `?Sys.getlocale`, `?Quotes` and `?Encoding`. Perhaps this is Windoze with code page 1252? – IRTFM Oct 14 '18 at 22:41
  • I tried `Sys.setlocale(locale = "lt_LT.UTF-8")` on MacOS, which worked fine as well. Please post the results of `sessionInfo()` ... – Ben Bolker Oct 14 '18 at 22:55
  • I added session information. And yes, I'm using Windows OS. – GegznaV Oct 15 '18 at 10:25
  • 1
    I believe this is related to bugs [15762](https://bugs.r-project.org/bugzilla/show_bug.cgi?id=15762) and [16232](https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=16232). Also see [Ista Zahn's article on character encoding hell in R for Windows](https://dss.iq.harvard.edu/blog/escaping-character-encoding-hell-r-windows). – Montgomery Clift Oct 28 '18 at 08:22

2 Answers2

2

encoding = stringi::stri_enc_get() should work. c.f.: https://stackoverflow.com/a/46999569/5397672

read_tsv(locale = locale(decimal_mark = ",",
                         encoding = stringi::stri_enc_get()),
         "Ežeras     Plotas, ha  Gylis, m
Drūkšiai    4479,0  33,3
Dysnai  2439,4  6,0
")
yutannihilation
  • 798
  • 4
  • 9
1

Try fread from data.table library. It works with Lithuanian locale for me. Then you can convert to as_tibble() if like. Readr function converts output to UTF-8 by default. Thought after read_tsv you could use iconv() function. This solution also works fine.

  • Could you provide with an example of code, how you suggest using `read_tsv()` in combination with `iconv()` and the data I used in the question? – GegznaV Oct 15 '18 at 11:17
  • 1
    simplier solution: xx <-"Ežeras Plotas, ha Gylis, m Drūkšiai 4479,0 33,3 Dysnai 2439,4 6,0 " x <- fread(xx) %>% as_tibble() #-------------- or ------------ x <- read_tsv(locale = locale(decimal_mark = ","),xx) Encoding(c(names(x),x[[1]])) Encoding(names(x)) <- "ISO8859-13" Encoding(x[[1]]) <- "ISO8859-13" x – Kestutis Vinciunas Oct 16 '18 at 04:06