1

I created a corpus in R using package tm specifying language and encoding as follows:

de_DE.corpus <- Corpus(VectorSource(de_DE.sample), readerControl
    = list(language="de_DE",encoding = "UTF_8"))
de_DE.corpus[36]$content
de_DE.dtm <- DocumentTermMatrix(de_DE.corpus,control = list
    (encoding = 'UTF-8'))
inspect(de_DE.dtm[, grepl("grÃ", de_DE.dtm$dimnames$Terms)])
inspect(de_DE.dtm[36, ])

If I see the content in de_DE.corpus[36]$content of document 36 which has 'ü' the text is shown correctly. e.g. " ...Single ist so die Begründung der Behörde Eine... "

But when I create the DocumentTermMatrix (I tried multiple options for encoding and language) I am getting words like "begrÃ" where for example is the word "Begründung". See result after executing inspect(de_DE.dtm[36, ]).

<<DocumentTermMatrix (documents: 1, terms: 21744)>>

Non-/sparse entries: 102/21642

Sparsity : 100%

Maximal term length: 43

Weighting : term frequency (tf)

Sample :

Terms

Docs begrà das dem der die eine einen jobcenter und zum

36     3    4   2  4   8     2    2       4       3  3

I would appreciate if someone knows how to fix the problem. Thanks in advance :)

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
  • Which Operating System are you on? – knb Aug 11 '17 at 14:41
  • Windows 10, R Version 3.4.1, package ‘tm’ version 0.7-1 – Sandra Meneses Aug 17 '17 at 12:43
  • I don't know what's going on, but here's a potential clue: `text <- "Begründung"; Encoding(text) ## [1] "UTF-8"` Here's what happens if we set the wrong encoding: `Encoding(text) <- "latin1"; print(text) ## [1] "Begründung"` – Patrick Perry Oct 12 '17 at 23:02
  • After many failed attempts the only solution that I found was: `de_DE.corpus <- Corpus(VectorSource(de_DE.sample), readerControl = list(language="de_DE",encoding = "UTF_8"))` `de_DE.corpus <- tm_map(de_DE.corpus, function(x) iconv(x, from='UTF-8', to="latin1"))` `de_DE.corpus[4]$content` `de_DE.dtm <- DocumentTermMatrix(de_DE.corpus,control = list (encoding = 'UTF-8'))` `inspect(de_DE.dtm[4, ])` Hope it helps someone having the same issue. – Sandra Meneses Nov 13 '17 at 23:22

1 Answers1

0

Can you check your input data? Because your code works for me. So I think you have an issue when you are loading it already in de_DE.sample.

doc<-c("Single ist so die Begründung der Behörde Eine", "Single Begründung Behörde ")

de_DE.corpus <- Corpus(VectorSource(doc), readerControl
                       = list(language="de_DE",encoding = "UTF_8"))
de_DE.dtm <- DocumentTermMatrix(de_DE.corpus,control = list
                                (encoding = 'UTF-8'))

inspect(de_DE.dtm[1, ])
<<DocumentTermMatrix (documents: 1, terms: 7)>>
Non-/sparse entries: 7/0
Sparsity           : 0%
Maximal term length: 10
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs begründung behörde der die eine ist single
   1          1       1   1   1    1   1      1
Dr VComas
  • 735
  • 7
  • 22
  • Hi @Dr Vcomas, thanks for your reply. You're right, the problem is already in de_DE.sample. Checking the input data if a consult the encoding with `Encoding(de_DE.sample[36])`, it shows "UTF-8", but if I apply `iconv(de_DE.sample[36], to='UTF-8')` is showing me the characters as "..er single ist so die begründung der behörde". I don't understand why if it detects UTF-8 as the encoding is applying a transformation or how could I the data correctly. Hope with this additional information someone has an idea of how to solve the issue and could help me. :) – Sandra Meneses Aug 17 '17 at 12:47
  • Encoding issues are quite common. You will need to check the process, where is this data coming from, if there is a step where the data is saved with a given encoding, usually ppl extract the data and open it with excel for example, which usually introduces encoding issues, at least from my experience. Check every step of your data process. I hope it helps, you can still consider the question answered. Is not a tm or DocumentTermMatrix issue. – Dr VComas Aug 17 '17 at 13:35