1

I am using the Quanteda suite of packages to preprocess some text data. I want to incorporate collocations as features and decided to use the textstat_collocations function. According to the documentation and I quote:

"The tokens object . . . . While identifying collocations for tokens objects is supported, you will get better results with character or corpus objects due to relatively imperfect detection of sentence boundaries from texts already tokenized."

This makes perfect sense, so here goes:

library(dplyr)
library(tibble)
library(quanteda)
library(quanteda.textstats)

# Some sample data and lemmas
df= c("this column has a lot of missing data, 50% almost!",
                   "I am interested in missing data problems",
                   "missing data is a headache",
                   "how do you handle missing data?")

lemmas <- data.frame() %>%
    rbind(c("missing", "miss")) %>%
    rbind(c("data", "datum")) %>%
    `colnames<-`(c("inflected_form", "lemma"))

(1) Generate collocations using the corpus object:

txtCorpus = corpus(df)
docvars(txtCorpus)$text <- as.character(txtCorpus)
myPhrases = textstat_collocations(txtCorpus, tolower = FALSE)

(2) preprocess text and identify collocations and lemmatize for downstream tasks.

# I used a blank space as concatenator and the phrase function as explained in the documentation and I followed the multi multi substitution example in the documentation
# https://quanteda.io/reference/tokens_replace.html
txtTokens = tokens(txtCorpus, remove_numbers = TRUE, remove_punct = TRUE, 
                               remove_symbols = TRUE, remove_separators = TRUE) %>%
    tokens_tolower() %>%
    tokens_compound(pattern = phrase(myPhrases$collocation), concatenator = " ") %>%
    tokens_replace(pattern=phrase(c(lemmas$inflected_form)), replacement=phrase(c(lemmas$lemma)))

(3) test results

# Create dtm
dtm = dfm(txtTokens, remove_padding = TRUE)

# pull features
dfm_feat = as.data.frame(featfreq(dtm)) %>%
    rownames_to_column(var="feature") %>%
    `colnames<-`(c("feature", "count"))

dfm_feat
feature count
this 1
column 1
has 1
a 2
lot 1
of 1
almost 1
i 2
am 1
interested 1
in 1
problems 1
is 1
headache 1
how 1
do 1
you 1
handle 1
missing data 4

"missing data" should be "miss datum".

This is only works if each document in df is a single word. I can make the process work if I generate my collocations using a token object from the get-go but that's not what I want.

Cola4ever
  • 189
  • 1
  • 1
  • 16

1 Answers1

2

The problem is that you have already compounded the elements of the collocations into a single "token" containing a space, but by supplying the phrase() wrapper in tokens_compound(), you are telling tokens_replace() to look for two sequential tokens, not the one with a space.

The way to get what you want is by making the lemmatised replacement match the collocation.

phrase_lemmas <- data.frame(
  inflected_form = "missing data",
  lemma = "miss datum"
)
tokens_replace(txtTokens, phrase_lemmas$inflected_form, phrase_lemmas$lemma)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this"       "column"     "has"        "a"          "lot"       
## [6] "of"         "miss datum" "almost"    
## 
## text2 :
## [1] "i"          "am"         "interested" "in"         "miss datum"
## [6] "problems"  
## 
## text3 :
## [1] "miss datum" "is"         "a"          "headache"  
## 
## text4 :
## [1] "how"        "do"         "you"        "handle"     "miss datum"

Alternatives would be to use tokens_lookup() on uncompounded tokens directly, if you have a fixed listing of sequences you want to match to lemmatised sequences. E.g.,

tokens(txtCorpus) %>%
  tokens_lookup(dictionary(list("miss datum" = "missing data")),
    exclusive = FALSE, capkeys = FALSE
  )
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
##  [1] "this"       "column"     "has"        "a"          "lot"       
##  [6] "of"         "miss datum" ","          "50"         "%"         
## [11] "almost"     "!"         
## 
## text2 :
## [1] "I"          "am"         "interested" "in"         "miss datum"
## [6] "problems"  
## 
## text3 :
## [1] "miss datum" "is"         "a"          "headache"  
## 
## text4 :
## [1] "how"        "do"         "you"        "handle"     "miss datum"
## [6] "?"
Ken Benoit
  • 14,454
  • 27
  • 50
  • Hi Ken & thank you for your answer. The challenge with this approach is that I have a large df of lemmas (single words) and their replacements just as in the example. The collocations can be any combination of them. Are you suggesting I should construct a list of inflected_forms & lemmas specifically for the collocations using pattern matching? – Cola4ever Sep 04 '21 at 10:30
  • Then you could replace them prior to compounding and then compound the lemmas. – Ken Benoit Sep 04 '21 at 14:02
  • Thanks again Ken. That's what I ended up doing. It wasn't so straight forward for me. I separated the collocations into two columns (I only need 2) and then tokenized and did token_replace on each column. I then pasted back the words. This was pretty quick. String based functions would be another option but I get weird results. – Cola4ever Sep 05 '21 at 12:30