3

I have a project requiring me to search annual reports of various companies and find key phrases in them. I have converted the reports to text files, created and cleaned a corpus. I then created a document term matrix. The tm_term_score function only seems to work for single words and not phrases. Is it possible to search the corpus for key phrases (not necessarily the most frequent)?

For example -

I want to see how many times the phrase “supply chain finance” in each document in the corpus. However when I run the code using tm_term_score - it returns that no documents had the phrase.. When they in fact did.

My progress looks as follows

library(tm)
library(stringr)

setwd(‘C:/Users/Desktop/Annual Reports’)

dest<-“C:/Users/Desktop/Annual Reports”

a<-Corpus(DirSource(“C:/Users/Desktop/Annual Reports”), readerControl ≈ list (language ≈“lat”))

a<-tm_map(a, removeNumbers)
a<-tm_map(a, removeWords, stopwords(“english”))
a<-tm_map(a, removePunctuation)
a<-tm_map(a, stripWhitespace)

tokenizing.phrases<-c(“supply growth”,“import revenues”, “financing projects”) 

I am quite weak and new to r and cannot decifier how to search my corpus for these key phrases.

1 Answers1

2

Perhaps something like the following will help you.

First, create an object with your key phrases, such as

tokenizing.phrases <- c("general counsel", "chief legal officer", "inside counsel", "in-house counsel",
                        "law department", "law dept", "legal department", "legal function",
                        "law firm", "law firms", "external counsel", "outside counsel",
                        "law suit", "law suits", # can be hyphenated, eg.
                        "accounts payable", "matter management")

Then use this function (perhaps with tweaks for your needs).

phraseTokenizer <- function(x) {  
  require(stringr)

  x <- as.character(x) # extract the plain text from the tm TextDocument object
  x <- str_trim(x)
  if (is.na(x)) return("")
  #warning(paste("doing:", x))
  phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))

  if (any(phrase.hits)) {
    # only split once on the first hit, so not to worry about multiple occurrences of the same phrase
    split.phrase <- tokenizing.phrases[which(phrase.hits)[1]] 
    # warning(paste("split phrase:", split.phrase))
    temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
    out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) # this is recursive, since f() calls itself
  } else {
    out <- MC_tokenizer(x)
  }

  # get rid of any extraneous empty strings, which can happen if a phrase occurs just before a punctuation
  out[out != ""]
}

Then create your term document matrix with the phrases included in it.

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer))
lawyeR
  • 7,488
  • 5
  • 33
  • 63
  • Thank you for your response lawyeR. I am still struggling to work out your response. I have edited my question including some of your suggestion which I understand. Sorry about this! Im very new to r and I appreciate your help. – Warwick Maddock Jul 17 '15 at 03:31
  • Hi lawyeR! When I enter the code you provided I receive the following error and warning messages. Error in str_detect(x, ignore_case=TRUE(tokenising.phrases)): unused argument (ignore_case=TRUE(tokenising.phrases) in addition: Warning message: In if (is.na(a)) return (""): the condition has length >1 and only the first element will be used How can i resolve this problem? I appreciate your help! – Warwick Maddock Jul 20 '15 at 04:47