3

I am using the stringi package for a while now and everything works fine.

I recently wanted to put some regex inside a function and store that function in a separate file. The code works just fine if the function is loaded from the script but when it is sourced I do not get the expected result.

Here is the code to reproduce the issue :

clean <- function(text){
  stri_replace_all_regex(str = text, 
                         pattern = "(?i)[^a-zàâçéèêëîïôûùüÿñæœ0-9,\\.\\?!']",
                         replacement = " ")
}
text <- "A sample text with some french accent é, è, â, û and some special characters |, [, ( that needs to be cleaned."
clean(text) # OK
[1] "A sample text with some french accent é, è, â, û and some special characters  ,  ,   that needs to be cleaned."
source(clean.r)
clean(text) # KO
[1] "A sample text with some french accent  ,  ,  ,   and some special characters  ,  ,   that needs to be cleaned."

I want to remove everything that is not a letter, an accented letters and punctuation charcater ?, !, ,, and ..

The code works just fine if the function is loaded inside the script directly. If it is sourced then it gives a different result.

I also tried using stringr and I have the same problem. My files are saved in UTF-8 encoding.

I do not understand why this is happening, any help is greatly appreciated.

Thank you.

R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringi_1.1.5     data.table_1.10.4

loaded via a namespace (and not attached):
[1] compiler_3.4.1 tools_3.4.1    yaml_2.1.14 
maRmat
  • 363
  • 1
  • 14
  • I am having similar problems. I think it is to do with how the special characters are read when file is sourced. My text contains "£" and yours contain French accents "é, è, â, û". – psychonomics Aug 03 '18 at 14:56

1 Answers1

0

Try converting the text to ASCII first. This will change the characters, and may allow the same behaviour when you source the function in R.

+1 to Felipe Alvarenga https://stackoverflow.com/a/45941699/2069472

text <- "Ábcdêãçoàúü"
iconv(text, to = "ASCII//TRANSLIT")
psychonomics
  • 714
  • 4
  • 12
  • 26