how to add words into documents in corpus?

Question

I'm using the tm package to run LDA on my corpus. I have a corpus containing 10,000 documents.

rtcorpus.4star <- Corpus(DataframeSource(rt.subset.4star)) ##creates the corpus
rtcorpus.4star[[1]] ##accesses the first document

I'm trying to write a piece of code that will add the word "specialword" after certain words. So essentially: for a vector of words (good, nice, happy, fun, love) that I choose, I want to the code to loop through each document, and add the word "specialword" after any of these words.

So for example, given this document:

I had a really fun time

I want the result to be this:

I had a really fun specialword time

The issue is that I'm not sure how to do this because I don't know how to get the code to read within the corpus. I know I should do a for loop (or maybe not), but I'm not sure how to loop through each word in each document, and each document in the corpus. I'm also wondering if I can use something along the lines of a "translate" function that works in tm_map.

Edit::

Made some attempts. This codes returns "test" as NA. Do you know why?

special <- c("poor", "lose")
for (i in special){
test <- gsub(special[i], paste(special[i], "specialword"), rtcorpus.1star[[1]])
}

Edit: figured it out!! thanks

special <- c("poor", "lose")
for (i in 1:length(special)){
rtcorpus.codewordtest <-gsub(special[i], paste(special[i], "specialword"), rtcorpus.codewordtest)
}

Are you referring to LDA, as in latent Dirichlet allocation? This is a 'bag-of-words' method, so it doesn't know or care about word order within a document. All the words in each document are treated as a jumble, and that jumble is the basic unit of analysis. Inserting a word like this will only make a difference if you're splitting the documents into chunks and generating the lda model with those chunks. — Ben, Apr 06 '14 at 04:57

wugology · Answer 1 · 2014-04-05T23:36:09.513

1

What if you tried something like this?

corpus <- read("filename.txt")
special <- c("fun","nice","love")
for (w in special) {
    gsub(w, w + " specialword", corpus)}

This would load the file, iterate through your list of words, and replace the word with the word itself followed by " specialword" (note the space).

Edit: I just saw you have multiple files. To loop through the files in the corpus, you can do this:

 corpus <- "filepath/desktop/wherever/folderwithcorpus/"
 special <- c("fun","nice","love")

 for (file in corpus){
      data <- read(file)
      for (w in special){
           gsub(w, w + " specialword", corpus)}
      }

edited Apr 05 '14 at 23:36

answered Apr 05 '14 at 23:30

wugology

193
1
4
13

THanks for the suggestion. I tried that, and got this error: Error in w + " specialword" : non-numeric argument to binary operator – user2303557 Apr 05 '14 at 23:34
I think I'm mixing up my python and R syntax. There should be a way to concatenate them, but you may need regular expressions. – wugology Apr 05 '14 at 23:37
Hmm it is giving me that same error. Do you know if something like this "translate" function would work? I'm reading about it here: http://stackoverflow.com/questions/20580002/replace-words-in-corpus-according-to-dictionary-data-frame – user2303557 Apr 05 '14 at 23:41
I've never used translate but seems worth a try. – wugology Apr 05 '14 at 23:47
mind looking at my edit in the old post? it's returning NA. can't seem to figure out why. thanks! – user2303557 Apr 05 '14 at 23:54

score 0 · Answer 2 · answered Apr 05 '14 at 23:28

0

Perhaps this is not a tm package capability, but you could do a simple paste() function for the vector of your certain words and add "specialword" immediately after. Or str_replace() in the stringr package would do this if your documents can be in a list (I think).

Then create the corpus.

answered Apr 05 '14 at 23:28

lawyeR

7,488
5
33
63

My problem is that I have to first turn it into a corpus, because I need to stem the words. That way I don't have to replace "happy" or "happiness" -- it all is stemmed to "happi." – user2303557 Apr 05 '14 at 23:43

how to add words into documents in corpus?

2 Answers2