Using tm() to mine PDFs for two and three word phrases

Question

I'm trying to mine a set of PDFs for specific two and three word phrases. I know this question has been asked under various circumstances and

This solution partly works. However, the list does not return strings containing more than one word.

I've tried the solutions offered in these threads here, here, for example (as well as many others). Unfortunately nothing works.

Also, the qdap library won't load and I wasted an hour trying to solve that problem, so this solution won't work either, even though it seems reasonably easy.

library(tm)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))

my_words <- c("contract", "prices", "contract prices", "diamond", "shamrock", "diamond shamrock")

dtm <- DocumentTermMatrix(crude, control=list(dictionary = my_words))

# create data.frame from documenttermmatrix
df1 <- data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL)
head(df1)

As you can see, the output returns "contract.prices" instead of "contract prices" so I'm looking for a simple solution to this. File 127 includes the phrase 'contract prices' so the table should record at least one instance of this.

I'm also happy to share my actual data, but I'm not sure how to save a small portion of it (it's gigantic). So for now I'm using a substitute with the 'crude' data.

I am not sure what your bigger goal is but if you add `check.names` in your `data.frame` call you'll get `"oil corporation" `, `data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL, check.names = FALSE)` — Ronak Shah, Sep 28 '19 at 03:28
@RonakShah Thanks for your reply. The bigger goal is to search my corpus for specific phrases. This doesn't seem to solve the problem- while it displays "oil corporation" instead of "oil.corporation" it still doesn't count any of that phrase. For example, looking at text 127, the phrases "contract prices" and "diamond shamrock" occur at least once. If I replace the above ```my_words``` container with ```my_words <- c("contract", "prices", "contract prices", "diamond", "shamrock", "diamond shamrock")``` and run the rest of the code as is, it still does not count those two word phrases. — socialresearcher, Sep 28 '19 at 04:23

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

Here is a way to get what you want using the tm package together with RWeka. You need to create a separate tokenizer function that you plug into the DocumentTermMatrix function. RWeka plays very nicely with tm for this.

If you don't want to install RWeka due to java dependencies, you can use any other package like tidytext or quanteda. If you have need of speed because of the size of your data, I advice using the quanteda package (example below the tm code). Quanteda runs in parallel and with quanteda_options you can specify how many cores you want to use (2 cores are the default).

note:

Note that the unigrams and bigrams in your dictionary overlap . In the example used you will see that in text 127 "prices" (3) and "contract prices" (1) will double count the prices.

library(tm)
library(RWeka)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))

my_words <- c("contract", "prices", "contract prices", "diamond", "shamrock", "diamond shamrock")


# adjust to min = 2 and max = 3 for 2 and 3 word ngrams
RWeka_tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 1, max = 2)) 
}

dtm <- DocumentTermMatrix(crude, control=list(tokenize = RWeka_tokenizer,
                                              dictionary = my_words))

# create data.frame from documenttermmatrix
df1 <- data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL, check.names = FALSE)

For speed if you have a big corpus quanteda might be better:

library(quanteda)

corp_crude <- corpus(crude)
# adjust ngrams to 2:3 for 2 and 3 word ngrams
toks_crude <- tokens(corp_crude, ngrams = 1:2, concatenator = " ")
toks_crude <- tokens_keep(toks_crude, pattern = dictionary(list(words = my_words)), valuetype = "fixed")
dfm_crude <- dfm(toks_crude)
df1 <- convert(dfm_crude, to = "data.frame")

This worked great. The quanteda package seems like what I needed, thank you. — socialresearcher, Sep 28 '19 at 15:09

score 0 · Answer 2 · answered Oct 21 '19 at 10:30

You can work with series of tokens in quanteda if you first wrap your multi-word patterns in phrase() function.

library("quanteda")
#> Package version: 1.5.1

data("crude", package = "tm")
data_corpus_crude <- corpus(crude)

my_words <- c("diamond", "contract prices", "diamond shamrock")

You could extract these using kwic() for instance.

kwic(data_corpus_crude, pattern = phrase(my_words))
#>                                                               
#>    [127, 1:1]                             |     Diamond      |
#>    [127, 1:2]                             | Diamond Shamrock |
#>  [127, 12:13]        today it had cut its | contract prices  |
#>  [127, 71:71] a company spokeswoman said. |     Diamond      |
#>                                   
#>  Shamrock Corp said that effective
#>  Corp said that effective today   
#>  for crude oil by 1.50            
#>  is the latest in a

Or, to make them permanently into "compounded" tokens, use tokens_compound() (shown here in a simple example).

tokens("The diamond mining company is called Diamond Shamrock.") %>%
    tokens_compound(pattern = phrase(my_words))
#> tokens from 1 document.
#> text1 :
#> [1] "The"              "diamond"          "mining"          
#> [4] "company"          "is"               "called"          
#> [7] "Diamond_Shamrock" "."

Using tm() to mine PDFs for two and three word phrases

2 Answers2

note: