I have strange behaviour of data.table and %in% operator. I am loading data.table with Russian letters in utf-8 header.
d = fread(filename, sep="\t", encoding="UTF-8", verbose=TRUE)
bar=names(d)
bar
[1] "Дата, Время" "Состояние" "Ia, A" "Ib, A" "Ic, A"
[6] "Дисб.I" "акт.P,кВт" "P, кВА" "cos" "Загр., %"
[11] "Uвх.AB,В" "Uвх.BC,В" "Uвх.CA,В" "Дисб. U, %" "R, кОм"
[16] "F Турб.вращ.,Гц" "Приток,куб.м/cут" "Отбор,куб.м/cут" "P, ат." "Расход, куб.м/c"
[21] "Tдвиг, °C" "Tжид, °C" "Pвыкид, ат." "Tвыкид, °C" "Вибр X/Y, м/с2"
[26] "Вибр Z, м/с2" "Pвыс.р, ат." "Iутеч, мA" "Tобм, °C" "Акт.энерг,кВт"
[31] "Реакт.энерг,кВАр" "Вход1,ед." "Вход2,ед." "Вход3,ед." "Вход4,ед."
[36] "Вход5,ед." "Вход6,ед." "Вход7,ед." "Вход8,ед." "Статусн.сообщ."
I have one of values hardcoded in code
foo="Uвх.AB,В"
And trying to do the following
if (foo %in bar) { ... }
to the surprise
foo %in% bar
[1] FALSE
but
foo==bar
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[17] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[33] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
notice the TRUE on 11th position, the reason is in encoding
Encoding(foo)
[1] "UTF-8"
Encoding(bar)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[10] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[19] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[28] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[37] "unknown" "unknown" "unknown" “unknown"
On data.table behalf it is just a bit strange because I’ve asked encoding="UTF-8” on fread. On the other hand %in% aka match behaviour difference with == is also very strange.
I sense the wrongness of the universe, could somebody explain me why is %in% acts in so strange way with encodings and what is correct way of using it?