R tm package stemCompletion 'Out of Memory'

Question

I have been trying to work through the following tutorial: http://www.rdatamining.com/examples/text-mining however, instead of using the twitter data I have been using .csv file (unfortunately the contents are sensitive and cannot be made public).

The .csv file has two columns a user key in column A and a piece of narrative text (Response) in column B. The file has been opened with the following code,

Data <- read.csv(file="PATH/FILE.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
Data <- Data[!(Data$Response==""), ]
df<- do.call("rbind", lapply(Data$Response, as.list))

df is a 'list of 91' with each item in the list being of type "character".

The tutorial is followed from the line library(tm) with no differences except the addition of NarrativeCorpus <- tm_map(NarrativeCorpus, PlainTextDocument) after myCorpus <- tm_map(myCorpus, removeWords, myStopwords), which I found was needed for stemming.

The code fails at stem completion: myCorpus <- tm_map(myCorpus, stemCompletion, dictionary=dictCorpus) with the error,

Error in grep(sprintf("^%s", w), dictionary, value = TRUE) : invalid regular expression, reason 'Out of memory'

I have tried to look on-line and on stack overflow with little luck.

I have tried converting the reference dictionary into a list of unique words then back into a corpus (to reduce its size) but to no avail.

I am using R 64-bit 3.2.3 with RStudio Desktop 0.99.891 on a Windows 7 laptop with 4GB RAM. All packages are up to date (according to RStudio).

This is my first SO post so I welcome advise on what I should have included and why, etc..

Did you try adding `lazy=TRUE` to all `tm_map` calls? Also, try to add some replication code, eventually replacing your data with something else. — CptNemo, Mar 01 '16 at 14:41
@CptNemo Thanks. That seems to have solved the memory issue... But the stemCompletion is acting on the 'Metdata' 'Content' tags of the corpus and seemingly leaving the actual content untouched. — Chas Nelson, Mar 01 '16 at 15:47
So you still keep getting the error about the invalid regular expression? You really need to post some replication code and data... — CptNemo, Mar 02 '16 at 10:39
@CptNemo, sorry I didn't explain myself very well. The memory issue (including the invalid regular expression error) was solved by your suggestion. The second part [of my comment] was a separate issue that I have also now resolved. For my future reference what do you mean by replication code? All the code used is either in or referenced in my initial question. The lack of data was mentioned in my question but I will make up example data for future questions, thanks. — Chas Nelson, Mar 02 '16 at 11:37

score 0 · Answer 1 · edited May 23 '17 at 11:46

0

I had the similar issue, Error in grep(sprintf("^%s", w), dictionary, value = TRUE) : invalid regular expression and after searching in SO, I found the solution in this thread which was found from this website.

This code should be added after loading your corpus:

content_transformer <- function(x) iconv(x, to='UTF-8-MAC', sub='byte')
myCorpus <- tm_map(myCorpus, content_transformer)

Good luck

edited May 23 '17 at 11:46

Community

1
1

answered Mar 08 '17 at 04:20

Habib Karbasian

556
8
18

1

Hello, please expand your answer to actually provide what the fix was that worked for you, instead of just a link to another post. – Chait Mar 08 '17 at 04:44

R tm package stemCompletion 'Out of Memory'

1 Answers1