Remove meaningless words from corpus in R

Question

I am using tm and wordcloud for performing some basic text mining in R. The text being processed contains many words which are meaningless like asfdg,aawptkr and i need to filter such words. The closest solution i have found is using library(qdapDictionaries) and building a custom function to check validity of words.

library(qdapDictionaries)
is.word  <- function(x) x %in% GradyAugmented

# example
> is.word("aapg")
[1] FALSE

The rest of text mining used is :

curDir <- "E:/folder1/"  # folder1 contains a.txt, b.txt
myCorpus <- VCorpus(DirSource(curDir))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)

myCorpus <- tm_map(myCorpus,foo) # foo clears meaningless words from corpus

The issue is is.word() works fine for handling dataframes but how to use it for corpus handling ?

Thanks

@s.brunel, `content_transformer` works with function that modify corpus, `is.word` is just returning True / False — parth, Jun 02 '17 at 05:08

score 6 · Answer 1 · answered Jun 14 '17 at 22:20

If you are willing to try a different text mining package, then this will work:

library(readtext)
library(quanteda)
myCorpus <- corpus(readtext("E:/folder1/*.txt"))

# tokenize the corpus
myTokens <- tokens(myCorpus, remove_punct = TRUE, remove_numbers = TRUE)
# keep only the tokens found in an English dictionary
myTokens <- tokens_select(myTokens, names(data_int_syllables))

From there you can form at document-term matrix (called a "dfm" in quanteda) for analysis, and it will only contain the features found as English-language terms as matched in the dictionary (which contains about 130,000 words).

Thanks @Ken for help, i looked up the `quanteda` and reduced some data rather than considering all words for speeding the process — parth, Jun 15 '17 at 04:18

score 2 · Accepted Answer · answered Jun 09 '17 at 16:58

2

Not sure if it will be the most resource efficient method (I don't know the package very well) but it should work:

tdm <- TermDocumentMatrix(myCorpus )
all_tokens       <- findFreqTerms(tdm, 1)
tokens_to_remove <- setdiff(all_tokens,GradyAugmented)
corpus <- tm_map(corpus, content_transformer(removeWords), 
                 tokens_to_remove)

answered Jun 09 '17 at 16:58

moodymudskipper

46,417
11
121
167

Thanks @Moody for response, it filters out words to some extent. – parth Jun 13 '17 at 05:04
To some extent ? Maybe make sure both sides are lowercase – moodymudskipper Jun 13 '17 at 05:21
Yeah thanks, i applied some transformations before these steps, works all fine. Only issue it is resource consuming. – parth Jun 15 '17 at 04:15

Remove meaningless words from corpus in R

2 Answers2

Linked