3

I have strange behaviour of data.table and %in% operator. I am loading data.table with Russian letters in utf-8 header.

d = fread(filename, sep="\t", encoding="UTF-8", verbose=TRUE)
bar=names(d)
bar

 [1] "Дата, Время"      "Состояние"        "Ia, A"            "Ib, A"            "Ic, A"           
 [6] "Дисб.I"           "акт.P,кВт"        "P, кВА"           "cos"              "Загр., %"        
[11] "Uвх.AB,В"         "Uвх.BC,В"         "Uвх.CA,В"         "Дисб. U, %"       "R, кОм"          
[16] "F Турб.вращ.,Гц"  "Приток,куб.м/cут" "Отбор,куб.м/cут"  "P, ат."           "Расход, куб.м/c" 
[21] "Tдвиг, °C"        "Tжид, °C"         "Pвыкид, ат."      "Tвыкид, °C"       "Вибр X/Y, м/с2"  
[26] "Вибр Z, м/с2"     "Pвыс.р, ат."      "Iутеч, мA"        "Tобм, °C"         "Акт.энерг,кВт"   
[31] "Реакт.энерг,кВАр" "Вход1,ед."        "Вход2,ед."        "Вход3,ед."        "Вход4,ед."       
[36] "Вход5,ед."        "Вход6,ед."        "Вход7,ед."        "Вход8,ед."        "Статусн.сообщ."

I have one of values hardcoded in code

foo="Uвх.AB,В"

And trying to do the following

if (foo %in bar) { ... } 

to the surprise

foo %in% bar

[1] FALSE

but

foo==bar

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[17] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[33] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

notice the TRUE on 11th position, the reason is in encoding

Encoding(foo)

[1] "UTF-8"

Encoding(bar)

 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[10] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[19] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[28] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[37] "unknown" "unknown" "unknown" “unknown"

On data.table behalf it is just a bit strange because I’ve asked encoding="UTF-8” on fread. On the other hand %in% aka match behaviour difference with == is also very strange.

I sense the wrongness of the universe, could somebody explain me why is %in% acts in so strange way with encodings and what is correct way of using it?

Anatoliy Orlov
  • 469
  • 2
  • 5
  • 2
    It would be better to include your sample data in a more [reproducible format](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Ideally something that doesn't rely on using `fread` from a file that we don't have access to. That will make it easier to help you. – MrFlick Nov 28 '16 at 21:17
  • Do I understand correctly that you have an issue with `fread()` but not with `data.table` in general? – Uwe Nov 28 '16 at 21:35
  • 1
    I see something similar with the name constructor in `data.frame()`. Bad result: `"Uвх.AB,В" == names(data.frame("Uвх.AB,В" = 1))` vs good result: `"Uвх.AB,В" == names(setNames(data.frame(1),"Uвх.AB,В"))`. It's probably because `fread` uses `make.names` unless you set `check.names = FALSE`. – Frank Nov 28 '16 at 22:10
  • give `read.csv2` a shot, colleague had encoding issues that were resolved with it, maybe it works for you – John Smith Nov 29 '16 at 19:24
  • I had a similar problem a while back ago: http://stackoverflow.com/questions/39633211/data-table-logical-comparison-and-encoding-bugs-errors-in-non-english-enviromen/40906765#40906765 Check if your problem is fixed with the new dt version 1.9.8 – ErrantBard Dec 01 '16 at 10:00

0 Answers0