I have a data frame containing character vectors. The data I used is a web scraping output of a particular website gathered during previous year within different computers (PC). Operational system probably was the same (Windows). After I combined all pieces together into single data frame I found, that after applying of gsub
function for instance (str_replace
causes same issue) cause an encoding distortion, so that some of Polish characters become wrongly encoded (Figure and system specification of PC presented below). voivodeshipRaw is a raw\original data, while voivodeshipProcessed is a character vector after applying gsub()
(I tried to remove unusual spaces “\s+”). I applied Encoding
and stri_enc_detect
to detect encoding. The output presented in columns: encodingType and stri_enc_detect. As you might see output differs. Cells with encoding distortions (column voivodeshipProcessed id: 4, 7, 1946505 and 1946507) have unknown encoding based on Encoding
function and windows-1250 based on stri_enc_detect
. I tried to change encoding of such cells using following functions:
stri_enc_toutf8 = stri_enc_toutf8(str = voivodeshipRaw)
encoding = sapply(voivodeshipRaw, function(x){
Encoding(x) <- "UTF-8"
return(x)
})
iconv = iconv(voivodeshipRaw, from = "windows-1250", to = "utf-8")
The output presented in Figure below. A you might see distortions still exist.
charToRaw()
output for voivodeshipRaw column in figure below (raw data, śląskie voivodeship, id: 1946504 and 1946505) without and with an encoding issue:
# id: 1946504 -> proper encoding
c5 9b 6c c4 85 73 6b 69 65
# id: 1946505 -> wrong encoding
9c 6c b9 73 6b 69 65
My question is how could I avoid such encoding distortions after applying gsub
, str
or stri
functions?
System specification:
sessionInfo()
# R version 4.1.1 (2021-08-10)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C LC_TIME=Polish_Poland.1250
system code page: 1251