When you ask for a kwic
, you will get all pattern matches, even when these overlap. So the way to avoid the overlap, in the way that I think you are asking, is to manually convert the multi-word expressions (MWEs) into single tokens in a way that prevents their overlap. In your case you want to count "Canadian charter" when it is not followed by "of rights". I would then suggest you tokenize the text, and then compound the MWEs in a sequence that guarantees that they will not overlap.
library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.0
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- "The Canadian charter of rights and the Canadian charter are different."
dict <- dictionary(list(constitution = c("charter of rights", "canadian charter")))
toks <- tokens(txt)
tokscomp <- toks %>%
tokens_compound(phrase("charter of rights"), concatenator = " ") %>%
tokens_compound(phrase("Canadian charter"), concatenator = " ")
tokscomp
## tokens from 1 document.
## text1 :
## [1] "The" "Canadian" "charter of rights"
## [4] "and" "the" "Canadian charter"
## [7] "are" "different" "."
This has made the phrases into single tokens, delimited here by a space, and this will mean that in kwic()
(if that is what you want to use) will not double count them since they are now uniquely MWE matches.
kwic(tokscomp, dict, window = 2)
##
## [text1, 3] The Canadian | charter of rights | and the
## [text1, 6] and the | Canadian charter | are different
Note that simply to count them, you could have used dfm()
with your dictionary as the value of a select
argument:
dfm(tokscomp, select = dict)
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
## features
## docs charter of rights canadian charter
## text1 1 1
Finally, if you had wanted principally to distinguish "Canadian charter of rights" from "Canadian charter", you could have compounded the former first and then the latter (longest to shortest is best here). But that is not exactly what you asked.