2

Hi there: I m using the tm package for some text analysis and I need to sub a vector of terms with the paired replacement term in a vector of replacements. So the pattern / replacement dictionary looks like this.

#pattern -replacement dictionary
df<-data.frame(replace=c('crude', 'oil', 'price'), with=c('xcrude', 'xoil', 'xprice'))
#load tm
library(tm)
#load crude
data('crude')

I tried this and received an error

tm_map(crude, mapply, gsub, df$replace, df$with)

Warning message:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code
spindoctor
  • 1,719
  • 1
  • 18
  • 42

1 Answers1

2

Based on this answer you could use stringi and wrap it around content_transformer() to preserve the corpus structure:

corp <- tm_map(crude, content_transformer(
  function(x) { 
    stri_replace_all_fixed(x, df$replace, df$with, vectorize_all = FALSE) 
    })
  )

Or multigsub from qdap

corp <- tm_map(crude, content_transformer(
  function(x) { 
    multigsub(df$replace, df$with, fixed = FALSE, x) 
    })
  )

Which gives:

> corp[[1]][1]

"Diamond Shamrock Corp said that\neffective today it had cut its contract xprices for xcrude xoil by\n1.50 dlrs a barrel.\n The reduction brings its posted xprice for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n \"The xprice reduction today was made in the light of falling\nxoil product xprices and a weak xcrude xoil market,\" a company\nspokeswoman said.\n
Diamond is the latest in a line of U.S. xoil companies that\nhave cut its contract, or posted, xprices over the last two days\nciting weak xoil markets.\n Reuter"

You can then apply other tm functions on the resulting corpus:

> DocumentTermMatrix(corp)
#<<DocumentTermMatrix (documents: 20, terms: 1269)>>
#Non-/sparse entries: 2262/23118
#Sparsity           : 91%
#Maximal term length: 17
#Weighting          : term frequency (tf)
Community
  • 1
  • 1
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
  • 1
    And if the dictionary of terms to be replaced is actually regular expressions, then I would use: stri_replace_all_regex – spindoctor May 12 '16 at 15:43
  • 1
    So, this only partially working. When you check the structure of what this returns, it's a list, not a corpus. #Check structure str(corp) str(crude) #Check class this is the same class(corp) class(crude) I think this is significant because now basic functions like a DocumentTermMatrix now longer work> #DTM DocumentTermMatrix(corp) DocumentTermMatrix(crude) – spindoctor May 12 '16 at 18:19