I am implementing an LDA topic model using tm
and topicmodels
packages. Some of the documents contain odd characters that are not removed automatically (e.g. docs <- tm_map(docs, removePunctuation
does not remove ’
. When I read the .txt files into R, the Euro sign €, for example, shows up as €
. There are other odd characters throughout the corpus that show up frequently and need to be removed manually. Thus, I use the following lines to do it:
docs <- tm_map(docs, toSpace, "’")
docs <- tm_map(docs, toSpace, "‐")
docs <- tm_map(docs, toSpace, "–")
docs <- tm_map(docs, toSpace, "€")
docs <- tm_map(docs, toSpace, "’")
My problem is that once I close the R-script and reopen it, these odd symbols change. Instead of ’
the sript shows '
, instead of ’
it shows â???T
. As a result, the symbols are not removed from the text when I close and reopen the R-script and I have to manually change these symbols to what I need every-time the script is reopened. I copied these lines into a Word document and every time I reopen R-script I paste the lines from Word document into R-script. This is very inefficient. So I wonder is there a way for me to save the R-script so that these odd symbols are not lost after reopening? Or maybe I should do something with my original .txt files? Thank you!