0

I'm using the tm package to run LDA on my corpus. I have a corpus containing 10,000 documents.

rtcorpus.4star <- Corpus(DataframeSource(rt.subset.4star)) ##creates the corpus
rtcorpus.4star[[1]] ##accesses the first document

I'm trying to write a piece of code that will add the word "specialword" after certain words. So essentially: for a vector of words (good, nice, happy, fun, love) that I choose, I want to the code to loop through each document, and add the word "specialword" after any of these words.

So for example, given this document:

I had a really fun time

I want the result to be this:

I had a really fun specialword time

The issue is that I'm not sure how to do this because I don't know how to get the code to read within the corpus. I know I should do a for loop (or maybe not), but I'm not sure how to loop through each word in each document, and each document in the corpus. I'm also wondering if I can use something along the lines of a "translate" function that works in tm_map.


Edit::

Made some attempts. This codes returns "test" as NA. Do you know why?

special <- c("poor", "lose")
for (i in special){
test <- gsub(special[i], paste(special[i], "specialword"), rtcorpus.1star[[1]])
}

Edit: figured it out!! thanks

special <- c("poor", "lose")
for (i in 1:length(special)){
rtcorpus.codewordtest <-gsub(special[i], paste(special[i], "specialword"), rtcorpus.codewordtest)
}
user2303557
  • 225
  • 1
  • 6
  • 15
  • Are you referring to LDA, as in latent Dirichlet allocation? This is a 'bag-of-words' method, so it doesn't know or care about word order within a document. All the words in each document are treated as a jumble, and that jumble is the basic unit of analysis. Inserting a word like this will only make a difference if you're splitting the documents into chunks and generating the lda model with those chunks. – Ben Apr 06 '14 at 04:57

2 Answers2

1

What if you tried something like this?

corpus <- read("filename.txt")
special <- c("fun","nice","love")
for (w in special) {
    gsub(w, w + " specialword", corpus)}

This would load the file, iterate through your list of words, and replace the word with the word itself followed by " specialword" (note the space).

Edit: I just saw you have multiple files. To loop through the files in the corpus, you can do this:

 corpus <- "filepath/desktop/wherever/folderwithcorpus/"
 special <- c("fun","nice","love")

 for (file in corpus){
      data <- read(file)
      for (w in special){
           gsub(w, w + " specialword", corpus)}
      }
wugology
  • 193
  • 1
  • 4
  • 13
  • THanks for the suggestion. I tried that, and got this error: Error in w + " specialword" : non-numeric argument to binary operator – user2303557 Apr 05 '14 at 23:34
  • I think I'm mixing up my python and R syntax. There should be a way to concatenate them, but you may need regular expressions. – wugology Apr 05 '14 at 23:37
  • Hmm it is giving me that same error. Do you know if something like this "translate" function would work? I'm reading about it here: http://stackoverflow.com/questions/20580002/replace-words-in-corpus-according-to-dictionary-data-frame – user2303557 Apr 05 '14 at 23:41
  • I've never used translate but seems worth a try. – wugology Apr 05 '14 at 23:47
  • mind looking at my edit in the old post? it's returning NA. can't seem to figure out why. thanks! – user2303557 Apr 05 '14 at 23:54
0

Perhaps this is not a tm package capability, but you could do a simple paste() function for the vector of your certain words and add "specialword" immediately after. Or str_replace() in the stringr package would do this if your documents can be in a list (I think).

Then create the corpus.

lawyeR
  • 7,488
  • 5
  • 33
  • 63
  • My problem is that I have to first turn it into a corpus, because I need to stem the words. That way I don't have to replace "happy" or "happiness" -- it all is stemmed to "happi." – user2303557 Apr 05 '14 at 23:43