2 word phrase collocations using quanteda in R

Question

This is regarding the textstat_collocations functionality in quanteda package in R. I am getting more than 2 word phrases in the output even though I am requesting only for the 2 word phrases.

The necessary processing steps are as follows (corpus1 is already created using corpus function):

collocations_two_words <- textstat_collocations(corpus1, method = "lambda", size = 2, min_count = 5, smoothing = 0.5, tolower = TRUE)

collocations_two_words <- collocations_two_words[collocations_two_words$count >= 10,]

tokens1 <- tokens(tolower(corpus1), what = "word", remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_separators = TRUE, remove_url = TRUE, remove_hyphens = TRUE)

tokens1 <- tokens_remove(tokens1, stopwords("english"), padding = TRUE)

tokens2 <- tokens_compound(tokens1, pattern = collocations_two_words)

quantdfm <- dfm(tokens2, remove_punct = TRUE, remove_numbers = TRUE)

quantdfm <- dfm_trim(quantdfm, min_count = 5, min_docfreq = 5, verbose = TRUE)

When I inspect the quantdfm object (using tail(quantdfm)), I am getting more than 2 word phrases. Can someone guide me on where I might be going wrong?

Sample output looks like this: docs choosing_dark_chocolate_can eat_dark_chocolate text43979 0 0 text43980 0 0 text43981 0 0 text43982 0 0 text43983 0 0 text43984 0 0

Output of dput(head(corpus1,5)):
structure(list(documents = structure(list(texts = c("..., video game consoles, stereos, smartphone chargers, and other similar devices constantly draw power into their power supplies. Unplug all of your chargers, whether it's for a tablet or a toothbrush. Electronics with standby or \"\"sleep\"\" modes: Desktop PCs, televisions, cable boxes, DVD-ray players, alarm clocks, radios, and anything with a remote", 
"...its judgment and order dated 02.05.2016 in Modern Dental College Research Centre (supra) authorizing it to oversee all statutory functions under the Act and leaving it at liberty to issue appropriate remedial directions, the impugned order is in the teeth of the recommendations of the said Committee, as communicated in its letter dated 14.05.2017", 
"...' focus to the ayurveda sector, especially in oral care. A year ago, Colgate launched its first India-focused ayurvedic brand, Cibaca Vedshakti, aimed squarely at countering Dant Kanti. HUL too launched araft of ayurvedic personal care products, including toothpaste, under the Ayush brand. RIVAL TO WATCH OUT FOR Colgate Palmolive global CEO Ian", 
"...founder of Increate Value Advisors. Patanjali has brought the focus back on product efficacy. Rising above the noise of advertising, products have to first deliver value to the consumers. Ghee and tooth paste are the two most popular products of Patanjali  even though both of these have enough local and multinational competitors in the organised", 
"The Bombay High Court today came down heavily on the Maharashtra government for not providing space and or hiring enough employees for the State Human Rights Commission. The commission has been left a toothless tiger as due to a lack of space and employees, it has not been able to hear cases of human rights violations in Maharashtra. A division"
)), .Names = "texts", row.names = c("text1", "text2", "text3", 
"text4", "text5"), class = "data.frame"), metadata = structure(list(
    source = "D:/Users/ajoshi/Documents/* on x86-64 by ajoshi", 
    created = "Fri Jan 26 19:42:21 2018"), .Names = c("source", 
"created")), settings = structure(list(stopwords = NULL, collocations = NULL, 
    dictionary = NULL, valuetype = "glob", stem = FALSE, delimiter_word = " ", 
    delimiter_sentence = ".!?", delimiter_paragraph = "\n\n", 
    clean_tolower = TRUE, clean_remove_digits = TRUE, clean_remove_punct = TRUE, 
    units = "documents"), .Names = c("stopwords", "collocations", 
"dictionary", "valuetype", "stem", "delimiter_word", "delimiter_sentence", 
"delimiter_paragraph", "clean_tolower", "clean_remove_digits", 
"clean_remove_punct", "units"), class = c("settings", "list")), 
    tokens = NULL), .Names = c("documents", "metadata", "settings", 
"tokens"), class = c("corpus", "list"))

Output of R sessionInfo(): R version 3.4.3
other attached packages:
[1] servr_0.8           LDAvis_0.3.2        text2vec_0.5.1      stringr_1.2.0       data.table_1.10.4-3
[6] quanteda_0.99.22   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15         compiler_3.4.3       pillar_1.1.0         futile.logger_1.4.3  plyr_1.8.4          
 [6] futile.options_1.0.0 iterators_1.0.9      tools_3.4.3          digest_0.6.14        lubridate_1.7.1     
[11] tibble_1.4.1         gtable_0.2.0         lattice_0.20-35      rlang_0.1.6          Matrix_1.2-12       
[16] foreach_1.4.4        fastmatch_1.1-0      mlapi_0.1.0          grid_3.4.3           R6_2.2.2            
[21] RJSONIO_1.3-0        ggplot2_2.2.1        lambda.r_1.2         spacyr_0.9.3         magrittr_1.5        
[26] scales_0.5.0         codetools_0.2-15     mime_0.5             colorspace_1.3-2     httpuv_1.3.5        
[31] stringi_1.1.6        proxy_0.4-21         RcppParallel_4.3.20  lazyeval_0.2.1       munsell_0.4.3

Hello Ken Benoit, added the first few lines of corpus along with the sessionInfo. — ds_newbie, Jan 29 '18 at 09:10
The objective is to run an LDA by doing some text preprocessing and using phrases to improve the interpretability of the topics. LDAvis is used for visualization of the topics, additionally. — ds_newbie, Jan 29 '18 at 09:31

score 0 · Accepted Answer · answered Jan 29 '18 at 14:14

This is the result on my system with quanteda v1.0.0:

require(quanteda)
txt <- c("..., video game consoles, stereos, smartphone chargers, and other similar devices constantly draw power into their power supplies. Unplug all of your chargers, whether it's for a tablet or a toothbrush. Electronics with standby or \"\"sleep\"\" modes: Desktop PCs, televisions, cable boxes, DVD-ray players, alarm clocks, radios, and anything with a remote", 
         "...its judgment and order dated 02.05.2016 in Modern Dental College Research Centre (supra) authorizing it to oversee all statutory functions under the Act and leaving it at liberty to issue appropriate remedial directions, the impugned order is in the teeth of the recommendations of the said Committee, as communicated in its letter dated 14.05.2017", 
         "...' focus to the ayurveda sector, especially in oral care. A year ago, Colgate launched its first India-focused ayurvedic brand, Cibaca Vedshakti, aimed squarely at countering Dant Kanti. HUL too launched araft of ayurvedic personal care products, including toothpaste, under the Ayush brand. RIVAL TO WATCH OUT FOR Colgate Palmolive global CEO Ian", 
         "...founder of Increate Value Advisors. Patanjali has brought the focus back on product efficacy. Rising above the noise of advertising, products have to first deliver value to the consumers. Ghee and tooth paste are the two most popular products of Patanjali  even though both of these have enough local and multinational competitors in the organised", 
         "The Bombay High Court today came down heavily on the Maharashtra government for not providing space and or hiring enough employees for the State Human Rights Commission. The commission has been left a toothless tiger as due to a lack of space and employees, it has not been able to hear cases of human rights violations in Maharashtra. A division")
corp <- corpus(txt)
col <- textstat_collocations(corp, method = "lambda", size = 2, min_count = 1, smoothing = 0.5, tolower = TRUE)

head(col)

        collocation count count_nested length   lambda        z
1      human rights     2            0      2 7.742836 3.689434
2  colgate launched     1            0      2 5.030438 3.553188
3 rights commission     1            0      2 5.030438 3.553188
4   ayurvedic brand     1            0      2 5.030438 3.553188
5  enough employees     1            0      2 5.030438 3.553188
6      launched its     1            0      2 5.030438 3.553188

table(col$length)

  2 
226

All the collocation have two elements. I guess that you are seeing larger collocations, because your texts are not tokenized properly.

Thanks Kohei. I am using word level tokenization from tokens function of quanteda v0.99. The collocations object dataframe seen at my end has only 2 word phrases. (I actually export this dataframe). Next I use tokens compound to replace individual tokens of 2 word phrases with the actual phrases. When I inspect the last few rows of quantdfm object, is where I realise that more than 2 word phrases have now been also captured. Should I be using v1.0 of quanteda now, or, am I missing something? — ds_newbie, Jan 29 '18 at 15:39
You should definitely use v1.0. There are a lot of fixes and improvements. If you want me to investigate further, please upload your corpus to Dropbox or something and share it with me. — Kohei Watanabe, Jan 30 '18 at 06:01
I guess join = TRUE (default behavior) argument of tokens_compound results in this sort of a behavior. I am now using quanteda v1.0. Also one suggestion would be able to coerce collocations object after converting it to data.table, back to collocations. Eg: collocations --> data.table (for additional manipulations) --> collocations (trimmed). Some of the 2 word phrases were combinations of stopwords("english") and to remove such combinations one has to use subsetting syntax for data.frame, instead of data.table syntax. Thanks to Kohei and Ken for their inputs. — ds_newbie, Jan 30 '18 at 13:31
We recommend you to use `tokens_remove(x, stopwords() padding = TRUE)` before `textstat_collocations()` if you do not want collocations with function words. — Kohei Watanabe, Mar 04 '18 at 15:08

2 word phrase collocations using quanteda in R

1 Answers1