Odd symbols in R script lost after reloading

Question

I am implementing an LDA topic model using tm and topicmodels packages. Some of the documents contain odd characters that are not removed automatically (e.g. docs <- tm_map(docs, removePunctuation does not remove ’. When I read the .txt files into R, the Euro sign €, for example, shows up as â‚¬. There are other odd characters throughout the corpus that show up frequently and need to be removed manually. Thus, I use the following lines to do it:

docs <- tm_map(docs, toSpace, "’")  
docs <- tm_map(docs, toSpace, "‐")  
docs <- tm_map(docs, toSpace, "–")  
docs <- tm_map(docs, toSpace, "â‚¬")  
docs <- tm_map(docs, toSpace, "â€™")

My problem is that once I close the R-script and reopen it, these odd symbols change. Instead of ’ the sript shows ', instead of â€™ it shows â???T. As a result, the symbols are not removed from the text when I close and reopen the R-script and I have to manually change these symbols to what I need every-time the script is reopened. I copied these lines into a Word document and every time I reopen R-script I paste the lines from Word document into R-script. This is very inefficient. So I wonder is there a way for me to save the R-script so that these odd symbols are not lost after reopening? Or maybe I should do something with my original .txt files? Thank you!

This is an encoding issue... when you defined your corpus, did you use `encoding='UTF-8'` as an argument? Here is more info https://stackoverflow.com/questions/37278333/set-encoding-for-reading-text-files-into-tm-corpora — mysteRious, Mar 31 '18 at 15:02
Also more at https://stackoverflow.com/questions/24920396/r-corpus-is-messing-up-my-utf-8-encoded-text — mysteRious, Mar 31 '18 at 15:03
If using `RStudio`, try reload with encoding and choose `UTF-8` — niko, Mar 31 '18 at 15:45
it depends on whether you are using windows or unix, and tm, in my experience, is very troublesome with encodings — Elio Diaz, Mar 31 '18 at 17:34
Thank you everyone for your help! It was indeed a coding issue. I used `files <- DirSource(directory = inputdir,encoding ="UTF-8" ) docs<- VCorpus(x=files)` to define the corpus and the odd symbols do not show up anymore, so i dont need to remove them. Now i am struggling with removing `’` at the end of multiple words... but I guess it is a different issue — Michael, Mar 31 '18 at 21:58

Odd symbols in R script lost after reloading

0 Answers0