5

I'm trying to do some topic modelling but want to use phrases where they exist rather than single words i.e.

library(topicmodels)
library(tm)
my.docs = c('the sky is blue, hot sun', 'flowers,hot sun', 'black cats, bees, rats and mice')
my.corpus = Corpus(VectorSource(my.docs))
my.dtm = DocumentTermMatrix(my.corpus)
inspect(my.dtm)

When I inspect my dtm it splits all the words up, but I want all the phrases together i.e. there should be a column for each of: the sky is blue hot sun flowers black cats bees rats and mice

How do make the Document Term Matrix recognise phrases and words? they are comma seperated

The solution needs to be efficient as I want to run it over a lot of data

shecode
  • 1,716
  • 6
  • 32
  • 50

2 Answers2

6

You might try an approach using a custom tokenizer. You define the multiple-word terms you want as phrases (I am not aware of an algorithmic code to do that step):

tokenizing.phrases <- c("sky is blue", "hot sun", "black cats")

Note that no stemming is done, so if you want both "black cats" and "black cat", then you will need to enter both variations. Case is ignored.

Then you need to create a function:

    phraseTokenizer <- function(x) {
      require(stringr)

      x <- as.character(x) # extract the plain text from the tm TextDocument object
      x <- str_trim(x)
      if (is.na(x)) return("")
      #warning(paste("doing:", x))
      phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))

      if (any(phrase.hits)) {
        # only split once on the first hit, so you don't have to worry about multiple occurrences of the same phrase
        split.phrase <- tokenizing.phrases[which(phrase.hits)[1]] 
        # warning(paste("split phrase:", split.phrase))
        temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
        out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) 
      } else {
        out <- MC_tokenizer(x)
      }


 out[out != ""]
}

Then you proceed as normal to create a term document matrix, but this time you include the tokenized phrases in the corpus by means of the control argument.

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer)) 
lawyeR
  • 7,488
  • 5
  • 33
  • 63
  • i can see this solution being really useful when i want to define phrases from some dirty word data. but my phrases have already been defined. Will this skip any single words or phrases that are not defined. I basically have a long vector of phrases/words that dont need to be cleaned. is there a simpler solution where I can assume everything within each comma seperate field is a word or phrase that I want to include? – shecode Feb 02 '15 at 19:15
  • This answer leaves all the other words/terms as is. No change. What it does is allow tm to treat the defined terms as units (tokens). As to the simple solution, sure, just treat your comma separated multi-word phrases as the tokens. – lawyeR Feb 02 '15 at 20:32
  • Great. I am running it now. it is very slow. that is the other thing – shecode Feb 02 '15 at 21:44
  • If this is useful or answers your question -- I can't help on the speed of your computer or the size of your token list -- consider accepting the answer. Thanks. – lawyeR Feb 02 '15 at 22:06
  • my fault I should have stated that I need to run it over a sizable dataset - about 500 tokens and 15000 phrases – shecode Feb 02 '15 at 22:10
  • lawyeR your solution is good but it could not perform. I managed to solve it by adding a dash between the words in the phrases and using the regular dtm function – shecode Feb 02 '15 at 22:30
0

Maybe have a look at this relatively recent publication on that topic:

http://web.engr.illinois.edu/~hanj/pdf/kdd13_cwang.pdf

they give an algorithm for identifying phrases and partitioning/tokenizing a document into those phrases.

Whadupapp
  • 170
  • 9