2

let me start this by saying that I'm still pretty much a beginner with R. Currently I am trying out basic text mining techniques for Turkish texts, using the tm package. I have, however, encountered a problem with the display of Turkish characters in R.

Here's what I did:

docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur"))
writeLines(as.character(docs), con="documents.txt")

My thinking being, that setting the language to Turkish and the encoding to UTF-8 (which is the original encoding of the text files) should make the display of the Turkish characters İ, ı, ğ, Ğ, ş and Ş possible. Instead the output converts these charaters to I, i, g, G, s and S respectively and saves it to an ANSI-Encoding, which cannot display these characters.

writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"))

also saves the file without the characters in ANSI encoding.

This seems to not only be an issue with the output file.

writeLines(as.character(docs[[1]])

for example yields a line that should read "Okul ve cami açılışları umutları artırdı" but instead reads "Okul ve cami açilislari umutlari artirdi"

After reading this: UTF-8 file output in R I also tried the following code:

writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"), useBytes=T)

which didn't change the results.

All of this is on Windows 7 with both the most recent version of R and RStudio.

Is there a way to fix this? I am probably missing something obvious, but any help would be appreciated.

Brinxian
  • 21
  • 3
  • Can you use `write.table(..., fileEncoding = "UTF-8")` instead? – Mako212 Dec 22 '17 at 16:55
  • Thanks for the reply. Sadly, this yields the same results. The example line now reads "\", \"\", \"Okul ve cami açilislari umutlari artirdi\", \"\", \" Exact code used (taken from here: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/write.table.html): `write.table(as.character(docs), file = "documents.txt", append = FALSE, quote = TRUE, sep = " ", eol = "\n", na = "NA", dec = ".", row.names = TRUE, col.names = TRUE, qmethod = c("escape", "double"), fileEncoding = "UTF-8")` – Brinxian Dec 22 '17 at 17:27

1 Answers1

0

Here is how I keep the Turkish characters intact:

  1. Open a new .Rmd file in RStudio. (RStudio -> File -> New File -> R Markdown)
  2. Copy and Paste your text containing Turkish characters.
  3. Save the .Rmd file with encoding. (RStudio -> File -> Save with Encoding.. -> UTF-8)
  4. yourdocument <- readLines("yourdocument.Rmd", encoding = "UTF-8")
  5. yourdocument <- paste(yourdocument, collapse = " ")
  6. After this step you can create your corpus
  7. e.g start from VectorSource() in tm package.
  8. Turkish characters will appear as they should.
dataatomic
  • 51
  • 2
  • 4