I created a corpus in R using package tm specifying language and encoding as follows:
de_DE.corpus <- Corpus(VectorSource(de_DE.sample), readerControl
= list(language="de_DE",encoding = "UTF_8"))
de_DE.corpus[36]$content
de_DE.dtm <- DocumentTermMatrix(de_DE.corpus,control = list
(encoding = 'UTF-8'))
inspect(de_DE.dtm[, grepl("grÃ", de_DE.dtm$dimnames$Terms)])
inspect(de_DE.dtm[36, ])
If I see the content in de_DE.corpus[36]$content
of document 36 which has 'ü' the text is shown correctly. e.g. " ...Single ist so die Begründung der Behörde Eine... "
But when I create the DocumentTermMatrix (I tried multiple options for encoding and language) I am getting words like "begrÃ" where for example is the word "Begründung". See result after executing inspect(de_DE.dtm[36, ])
.
<<DocumentTermMatrix (documents: 1, terms: 21744)>>
Non-/sparse entries: 102/21642
Sparsity : 100%
Maximal term length: 43
Weighting : term frequency (tf)
Sample :
Terms
Docs begrà das dem der die eine einen jobcenter und zum
36 3 4 2 4 8 2 2 4 3 3
I would appreciate if someone knows how to fix the problem. Thanks in advance :)