0

I have a data frame containing character vectors. The data I used is a web scraping output of a particular website gathered during previous year within different computers (PC). Operational system probably was the same (Windows). After I combined all pieces together into single data frame I found, that after applying of gsub function for instance (str_replace causes same issue) cause an encoding distortion, so that some of Polish characters become wrongly encoded (Figure and system specification of PC presented below). voivodeshipRaw is a raw\original data, while voivodeshipProcessed is a character vector after applying gsub() (I tried to remove unusual spaces “\s+”). I applied Encoding and stri_enc_detect to detect encoding. The output presented in columns: encodingType and stri_enc_detect. As you might see output differs. Cells with encoding distortions (column voivodeshipProcessed id: 4, 7, 1946505 and 1946507) have unknown encoding based on Encoding function and windows-1250 based on stri_enc_detect. I tried to change encoding of such cells using following functions:

stri_enc_toutf8 = stri_enc_toutf8(str = voivodeshipRaw)

encoding = sapply(voivodeshipRaw, function(x){
      Encoding(x) <- "UTF-8"
      return(x)
      })

iconv = iconv(voivodeshipRaw, from = "windows-1250", to = "utf-8")

The output presented in Figure below. A you might see distortions still exist.

charToRaw() output for voivodeshipRaw column in figure below (raw data, śląskie voivodeship, id: 1946504 and 1946505) without and with an encoding issue:

# id: 1946504 -> proper encoding
c5 9b 6c c4 85 73 6b 69 65

# id: 1946505 -> wrong encoding
9c 6c b9 73 6b 69 65

My question is how could I avoid such encoding distortions after applying gsub, str or stri functions?

Figure enter image description here

System specification:

sessionInfo()
# R version 4.1.1 (2021-08-10)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250    LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C                   LC_TIME=Polish_Poland.1250    
system code page: 1251
jeparoff
  • 166
  • 8
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Please [do not post code or data in images](https://meta.stackoverflow.com/q/285551/2372064) – MrFlick Feb 17 '23 at 14:34
  • no guarantees but perhaps you could try adjusting the Locale so that LC_CTYPE = Polish_Poland.utf8 rather than Polish_Poland.1250 ? – Nir Graham Feb 17 '23 at 15:38
  • Switch to R 4.2. [R 4.2.0 on Windows came with a significant improvement. It uses UTF-8 as the native encoding…](https://blog.r-project.org/2022/06/16/upcoming-changes-in-r-4.2.1-on-windows/). No problems since I did it… – JosefZ Feb 17 '23 at 17:43

0 Answers0