1

I am new to R and am trying to remove meaningless words from corpus. I have a dataframe with emails in one column and the target variable in another. I'm trying to clean the email body data. I have used tm and qdap package for this. I have already gone through most of the other questions and tried the below example: Remove meaningless words from corpus in R The problem I am encountering is when I want to remove unwanted tokens (which are not dictionary words) from corpus, I am getting an error.

library(qdap)
library(tm)

corpus = Corpus(VectorSource(Email$Body))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, stripWhitespace)

corpus = tm_map(corpus, stemDocument)

tdm = TermDocumentMatrix(corpus)
all_tokens = findFreqTerms(tdm,1)
tokens_to_remove = setdiff(all_tokens, GradyAugmented)
corpus <- tm_map(corpus, content_transformer(removeWords), tokens_to_remove)

By running the above line of code I am getting below error.

  invalid regular expression '(*UCP)\b(zyx|zyer|zxxxxxâ|zxxxxx|zwischenzeit|zwei|zvolen|zverejneni|zurã|zum|zstepswc|zquez|zprã|zorunlulu|zona|zoho|znis|zmir|zlf|zink|zierk|zhou|zhodnoteni|zgyã|zgã|zfs|zfbeswstat|zerust|zeroâ|zeppelinstr|zellerstrass|zeldir|zel|zdanska|zcfqc|zaventem|zarecka|zarardan|zaragoza|zaobchã|zamã|zakã|zaira|zahradnikova|zagorska|zagã|zachyti|zabih|zã|yusof|yukinobu|yui|ypg|ypaint|youtub|yoursid|youâ|yoshitada|yorkshir|yollayan|yokohama|yoganandam|yiewsley|yhlhjpz|yer|yeovil|yeni|yeatman|yazarina|yazaki|yaz|yasakt|yarm|yara|yannick|yanlislikla|yakar|yaiza|yabortslitem|yã|xxxxx|xxxxgbl|xuezi|xuefeng|xprn|xma|xlsx|xjchvnbbafeg|xiii|xii|xiaonan|xgb|xcede|wythenshaw|wys|wydzial|wydzia|wycomb|www|wuppert|wroclaw|wroc|wrightâ|wpisana|woustvil|wouldnâ|worthwhil|worsley|worri|worldwid|worldâ|workwear|worcestershir|worc|wootton|wooller|woodtec|woodsid|woodmansey|woodley|woodham|woodgat|wonâ|wolverhampton|wjodoyg|wjgfjiq|witti|witt|witkowski|wiss
In addition: Warning message:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE),  :
  PCRE pattern compilation error
    'regular expression is too large'
    at ''

sample corpus for email:

[794] "c mailto sent march ne rntbci accountspay nmuk subject new sig plc item still new statement await retriev use link connect account connect account link work copi past follow text address bar top internet browser https od datainterconnect com sigd sigdodsaccount php p thgqmdz d dt s contact credit control contact experi technic problem visit http bau faq datainterconnect com sig make payment call autom credit debit card payment line sig may abl help improv cashflow risk manag retent recoveri contract disput via www sigfinancetool co uk websit provid detail uniqu award win servic care select third parti avail sig custom power" 

tokens_to_remove[1:10]
 [1] "advis"        "appli"        "atlassian"    "bosch"        "boschrexroth" "busi"        
 [7] "communic"     "dcen"         "dcgbsom"      "email" 

I want to remove all words which are otherwise meaningless in english i.e c, mailto, ne, accountspay, nmuk, etc.

Koyeli
  • 67
  • 1
  • 9

1 Answers1

0

I would do it as following:

library("readtext")
library(quanteda)
library(dplyr)
mytext<- c("Carles werwa went to sadaf buy trsfr in the supermanket", 
           "Marta needs to werwa sadaf go to Jamaica") # My corpus
tokens_to_remove<-c("werwa" ,"sadaf","trsfr")                         # My dictionary
TokenizedText<-tokens(mytext, 
                        remove_punct = TRUE, 
                        remove_numbers = TRUE)            # Tokenizing the words. You can input an english dictionary
mytextClean<- lapply(TokenizedText, function(x) setdiff(x, tokens_to_remove))          # setting the difference between both

mytextClean
$text1
[1] "Carles"      "went"        "to"          "buy"         "in"          "the"         "supermanket"

$text2
[1] "Marta"   "needs"   "to"      "go"      "Jamaica"

Tokens_to_remove could just be also an english dictionary, and then instead of setdiff() you could just use intersect().

Carles
  • 2,731
  • 14
  • 25
  • My problem is that, I have a dataframe of 1000 emails(Email$Body). So how can I change my 'mytext' in your code to create tokens properly? – Koyeli Jul 02 '19 at 07:05
  • I hope my edition helps you out. I made the example for 2 texts – Carles Jul 02 '19 at 07:22
  • Thanks it works! I used GradyAugmented dictionary for extracting tokens. Is it correct or should i use any other dictionary? Also how can i make DTM from this list of tokens? – Koyeli Jul 03 '19 at 03:55
  • I think the Grady augmented dictionarywork well, though I haven't used it neither I checked the documentation. I hope this can help you out https://stackoverflow.com/questions/56775324/is-there-a-way-to-loop-through-a-matrix-df-in-r-to-create-an-adjacency-matrix/56775712#56775712 – Carles Jul 03 '19 at 06:40