Removing non-English text from Corpus in R using tm()

Question

I am using tm() and wordcloud() for some basic data-mining in R, but am running into difficulties because there are non-English characters in my dataset (even though I've tried to filter out other languages based on background variables.

Let's say that some of the lines in my TXT file (saved as UTF-8 in TextWrangler) look like this:

Special
satisfação
Happy
Sad
Potential für

I then read my txt file into R:

words <- Corpus(DirSource("~/temp", encoding = "UTF-8"),readerControl = list(language = "lat"))

This yields the warning message:

Warning message:
In readLines(y, encoding = x$Encoding) :
  incomplete final line found on '/temp/file.txt'

But since it's a warning, not an error, I continue to push forward.

words <- tm_map(words, stripWhitespace)
words <- tm_map(words, tolower)

This then yields the error:

Error in FUN(X[[1L]], ...) : invalid input 'satisfa��o' in 'utf8towcs'

I'm open to finding ways to filter out the non-English characters either in TextWrangler or R; whatever is the most expedient. Thanks for your help!

If the goal is just to remove those non-ASCII characters, then this will to the trick: `sapply(words, function(row) iconv(row, "latin1", "ASCII", sub=""))` [(from here)](http://stackoverflow.com/a/15754155/1036500). But this will leave you with word fragments with missing characters. If you want to remove non-English words, then you might subset words with non-ASCII characters, add them to your stopword list and remove them as you remove stopwords. — Ben, Aug 09 '13 at 19:00
I'd actually seen that post, but it opens up the door to having to change the corpus into another object? Running this command on a corpus yields: `Error in UseMethod("tm_map", x) : no applicable method for 'tm_map' applied to an object of class "c('matrix', 'character')"` — roody, Aug 09 '13 at 19:03
You can convert the output of the `sapply` back into a corpus like so: `dat1 <- sapply(words, function(row) iconv(row, "latin1", "ASCII", sub=""))` then back to a corpus: `words1 <- Corpus(VectorSource(dat1))` — Ben, Aug 09 '13 at 19:05
Also, you should be able to fix your first warning message by opening the txt file in any text editor and adding a few blank lines to the bottom, then saving, closing, and reloading in R — Ben, Aug 09 '13 at 19:07
if you don't find an answer here, try the folks on the [Corpus Linguistics with R](https://groups.google.com/forum/#!forum/corpling-with-r) mailing list — drammock, Aug 10 '13 at 22:24

Ben · Accepted Answer · 2013-08-11T23:24:12.880

Here's a method to remove words with non-ASCII characters before making a corpus:

# remove words with non-ASCII characters
# assuming you read your txt file in as a vector, eg. 
# dat <- readLines('~/temp/dat.txt')
dat <- "Special,  satisfação, Happy, Sad, Potential, für"
# convert string to vector of words
dat2 <- unlist(strsplit(dat, split=", "))
# find indices of words with non-ASCII characters
dat3 <- grep("dat2", iconv(dat2, "latin1", "ASCII", sub="dat2"))
# subset original vector of words to exclude words with non-ASCII char
dat4 <- dat2[-dat3]
# convert vector back to a string
dat5 <- paste(dat4, collapse = ", ")
# make corpus
require(tm)
words1 <- Corpus(VectorSource(dat5))
inspect(words1)

A corpus with 1 text document

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
Special, Happy, Sad, Potential

score 1 · Answer 2 · edited Feb 15 '19 at 05:22

1

You can also use the package "stringi".

Using the above example:

library(stringi)
dat <- "Special,  satisfação, Happy, Sad, Potential, für"
stringi::stri_trans_general(dat, "latin-ascii")

Output:

[1] "Special,  satisfacao, Happy, Sad, Potential, fur"

edited Feb 15 '19 at 05:22

karel

5,489
46
45
50

answered Feb 15 '19 at 03:21

Wilfredo

55
6

Removing non-English text from Corpus in R using tm()

2 Answers2

Linked