0

I am implementing an LDA topic model using tm and topicmodels packages. Some of the documents contain odd characters that are not removed automatically (e.g. docs <- tm_map(docs, removePunctuation does not remove . When I read the .txt files into R, the Euro sign €, for example, shows up as €. There are other odd characters throughout the corpus that show up frequently and need to be removed manually. Thus, I use the following lines to do it:

docs <- tm_map(docs, toSpace, "’")  
docs <- tm_map(docs, toSpace, "‐")  
docs <- tm_map(docs, toSpace, "–")  
docs <- tm_map(docs, toSpace, "€")  
docs <- tm_map(docs, toSpace, "’")

My problem is that once I close the R-script and reopen it, these odd symbols change. Instead of the sript shows ', instead of ’ it shows â???T. As a result, the symbols are not removed from the text when I close and reopen the R-script and I have to manually change these symbols to what I need every-time the script is reopened. I copied these lines into a Word document and every time I reopen R-script I paste the lines from Word document into R-script. This is very inefficient. So I wonder is there a way for me to save the R-script so that these odd symbols are not lost after reopening? Or maybe I should do something with my original .txt files? Thank you!

Michael
  • 159
  • 1
  • 2
  • 14
  • 1
    This is an encoding issue... when you defined your corpus, did you use `encoding='UTF-8'` as an argument? Here is more info https://stackoverflow.com/questions/37278333/set-encoding-for-reading-text-files-into-tm-corpora – mysteRious Mar 31 '18 at 15:02
  • Also more at https://stackoverflow.com/questions/24920396/r-corpus-is-messing-up-my-utf-8-encoded-text – mysteRious Mar 31 '18 at 15:03
  • If using `RStudio`, try reload with encoding and choose `UTF-8` – niko Mar 31 '18 at 15:45
  • it depends on whether you are using windows or unix, and tm, in my experience, is very troublesome with encodings – Elio Diaz Mar 31 '18 at 17:34
  • Thank you everyone for your help! It was indeed a coding issue. I used `files <- DirSource(directory = inputdir,encoding ="UTF-8" ) docs<- VCorpus(x=files)` to define the corpus and the odd symbols do not show up anymore, so i dont need to remove them. Now i am struggling with removing `’` at the end of multiple words... but I guess it is a different issue – Michael Mar 31 '18 at 21:58

0 Answers0