How to remove non UTF-8 characters from text

Question

I need help removing non UTF-8 character from my word cloud. So far this is my code. I've tried gsub and removeWords and they are still there in my word cloud and I do not know what to do to get rid of them. Any help would be appreciated. Thank you for your time.

txt <- readLines("11-0.txt")
corpus = VCorpus(VectorSource(txt))
gsub("â€™","â€˜","",txt)

corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace) 
corpus = tm_map(corpus, removeWords, c("gutenberg","gutenbergtm","â€","project"))

tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing = TRUE)
d = data.frame(word=names(v),freq=v)

wordcloud(d$word,d$freq,max.words = 20, random.order=FALSE, rot.per=0.2, colors=brewer.pal(8, "Dark2"))

Edit: Here is my inconv version

txt <- readLines("11-0.txt")
Encoding(txt) <- "latin1"
iconv(txt, "latin1", "ASCII", sub="")

corpus = VCorpus(VectorSource(txt))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace) 
corpus = tm_map(corpus, removeWords, c("gutenberg","gutenbergtm","project"))

tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing = TRUE)
d = data.frame(word=names(v),freq=v)

wordcloud(d$word,d$freq,max.words = 20, random.order=FALSE, rot.per=0.2, colors=brewer.pal(8, "Dark2"))
title(main="Alice in Wonderland word cloud",font.main=1,cex.main =1.5)

dario · Accepted Answer · 2020-02-17T10:24:27.710

0

The signature of gsub is:

gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

Not sure what you wanted to do with

gsub("â€™","â€˜","",txt)

but that line is probably not doing what you want it to do...

See here for a previous SO question on gsub and non-ascii symbols.

Edit:

Suggested solution using iconv:

Removing all non-ASCII characters:

txt <- "â€™xxxâ€˜"

iconv(txt, "latin1", "ASCII", sub="")

Returns:

[1] "xxx"

edited Feb 17 '20 at 10:24

answered Feb 17 '20 at 09:56

dario

6,415
2
12
26

Yea, im not so sure about my gsub line cause im not understand it fully. What can I do with my gsub to make it right? – warmsoda Feb 17 '20 at 09:59
First you don't assign the result of `gsub` to anything, i.e. you don't use it at all! Secondly: The first three arguments to `gsub` are: `patter`, `replacement` and `x` with `pattern` being the regex pattern you want **to replace**, `replacement` being the string you want to **use instead** (i.e. the replacement) and `x` being the string where we want to do the substitution. Your code `gsub("â€™","â€˜","",txt)` has `"â€™"` as the regex (and is probably **not** valid regex), `"â€˜"` as the replacement string and `""` (the empty string) as the string where we want to do the replacement. – dario Feb 17 '20 at 10:08
So what should be the valid regex for that? I'm using gsub("â€","",txt) and it still got the ™ and ˜ left, what can i do with it – warmsoda Feb 17 '20 at 10:13
I've tried that, however, the result is still the same. Can you check my code above please. – warmsoda Feb 17 '20 at 10:28
You have to assign the result from `iconv`. For example: `txt <- iconv(txt, "latin1", "ASCII", sub="")` – dario Feb 17 '20 at 10:31
Thank you, that work wonders for me. Can you explain the logic behind why I have to assign that? Sorry I am kinda new to all this. – warmsoda Feb 17 '20 at 10:38
In most cases, **R** functions return a **new** object (i.e. they do **not** mutate on of the arguments in place). In your example `iconv` does not change `txt`, but returns a completely new object. If you do not assign it (In **R** we use `<-` for assignements, although `=` often works it's better to get used to use `<-`! And sometimes the distinction between `<-` and `=` is crucial!) – dario Feb 17 '20 at 10:46

How to remove non UTF-8 characters from text

1 Answers1

Edit: