1

I wrote a function for wrangling strings. It includes converting non-English character to English character and other operations.

trim <- function (x) gsub("^\\s+|\\s+$", "", x)

library(qdapRegex)

wrangle_string <- function(s) {
  # 1 character substitutions
  old1 <- "šžþàáâãäåçèéêëìíîïðñòóôõöùúûüýşğçıöüŞĞÇİÖÜ"
  new1 <- "szyaaaaaaceeeeiiiidnooooouuuuysgciouSGCIOU"
  s1 <- chartr(old1, new1, s)
  # 2 character substitutions
  old2 <- c("œ", "ß", "æ", "ø")
  new2 <- c("oe", "ss", "ae", "oe")
  s2 <- s1
  for(i in seq_along(old2)) s2 <- gsub(old2[i], new2[i], s2, fixed = TRUE)
  s2
  #diger donusumlar
  s2= gsub('[[:punct:] ]+',' ',s2)
  s2=tolower(s2)
  s2=trim(s2)
  s2=rm_white(s2)
  return(s2)
}

Here is my minimal data for reproduction:

outgoing=structure(list(source = structure(c(1L, 1L, 1L), .Label = "YÖNETIM KURULU BASKANLIGI", class = "factor"), 
    target = structure(c(2L, 1L, 3L), .Label = c("x Yayincilik Reklam ve Organizasyon Hizmetleri", 
    "Suat", "Yavuz"), class = "factor")), .Names = c("source", 
"target"), row.names = c(NA, 3L), class = "data.frame")

The thing is when I call the function directly it works.

wrangle_string("YÖNETİM KURULU BAŞKANLIĞI")

The result is:

 "yonetim kurulu baskanligi"

When I use it apply function on a data frame it looks like work when I check it with View(outgoing) function there is no problem.

outgoing$source=as.vector(sapply(outgoing$source,wrangle_string))

However, when I check the cell with outgoing[1,1] I get this:

"yonetİm kurulu başkanliği"

How can I fix this problem?

Suat Atan PhD
  • 1,152
  • 13
  • 27
  • Something doesn't seem right, please try to create a more [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that we can run and see the same result. Are you sure "source" is the first column? – MrFlick Oct 12 '17 at 14:30
  • I added the minimal data to reproduction. – Suat Atan PhD Oct 12 '17 at 14:44
  • I cannot reproduce the problem with the data you provided. `outgoing[1,1]` returns `"yonetim kurulu baskanligi"` as expected. – MrFlick Oct 12 '17 at 14:46
  • I used the `dput` function to provide data. However, the function also generates different string set. I want to get result as `"yonetim kurulu baskanligi"` but I am getting it as `"yonetİm kurulu başkanliği"` from my RStudio. When I use `dput` it generates string as `""yonetIm kurulu baskanligi""` . Tricky character is `İ` – Suat Atan PhD Oct 12 '17 at 14:52
  • Perhaps this is an encoding problem. Did you import your data with a special encoding? What does `Encoding(outgoing$source)` return and what locale is listed under `sessionInfo()`? What operating system are you using? – MrFlick Oct 12 '17 at 14:56
  • My encoding is uff-8. When I imported the data default encoding parameter of read_csv function is utf-8. When I launch `sessionInfo` result is: locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 // Also there is a screenshot for showing the `View` function problem: https://imgur.com/a/iv9XA – Suat Atan PhD Oct 12 '17 at 15:02
  • What told you that the encoding is utf-8? Because that session info tells me you are probably running Windows and R uses the default "latin1" encoding on that OS. This means that the `old` and `new` variables in your function will use "latin1" encoding. You need to be very careful when matching up non ASCII range characters. – MrFlick Oct 12 '17 at 15:08
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/156573/discussion-between-suat-atan-phd-and-mrflick). – Suat Atan PhD Oct 12 '17 at 15:08

1 Answers1

0

By the help and guidance of MrFlick I found the answer. The problem stems from local language settings. R was on English but my data includes Turkish characters. To solve the problem I executed this command:

Sys.setlocale("LC_CTYPE", "turkish")

and also I added the proper encoding parameter to my importing csv function like below:

outgoing <- read_delim("ebys_gidenevrak_rapor.csv", ";", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE,locale = locale(encoding = "utf-8"))
Suat Atan PhD
  • 1,152
  • 13
  • 27