A similar question has been answered here, but since that question's title (and accepted answer) do not make the obvious link, I will show you how this applies to your question specifically. I'll also provide additional detail below to implement your own basic stemmer using wildcards for the suffixes.
Manually mapping stems to inflected forms
The simplest way to do this is by using a custom dictionary where the keys are your stems, and the values are the inflected forms. You can then use tokens_lookup()
with the exclusive = FALSE, capkeys = FALSE
options to convert the inflected terms into their stems.
Note that I have modified your example a little to simplify it, and to correct what I think were mistakes.
library("quanteda")
packageVersion("quanteda")
[1] ‘0.99.9’
# no need for the data.frame() call
myText <- c("ala ma kotka", "kasia ma pieska")
toks <- tokens(myText,
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE)
Origin <- c("kot", "kot", "pies", "pies")
Word <- c("kotek", "kotka", "piesek", "pieska")
Then we create the dictionary, as follows. As of quanteda v0.99.9, values with the same keys are merged, so you could have a list mapping multiple, different inflected forms to the same keys. Here, I had to add new values since the inflected forms in your original Word
vector were not found in the myText
example.
temp_list <- as.list(Word)
names(temp_list) <- Origin
(stem_dict <- dictionary(temp_list))
## Dictionary object with 2 key entries.
## - [kot]:
## - kotek, kotka
## - [pies]:
## - piesek, pieska
Then tokens_lookup()
does its magic.
tokens_lookup(toks, dictionary = stem_dict, exclusive = FALSE, capkeys = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "ala" "ma" "kot"
##
## text2 :
## [1] "kasia" "ma" "pies"
Wildcarding all stems from common roots
An alternative is to implement your own stemmer using the "glob" wildcarding to represent all suffixes for your Origin
vector, which (here, at least) produces the same results:
temp_list <- lapply(unique(Origin), paste0, "*")
names(temp_list) <- unique(Origin)
(stem_dict2 <- dictionary(temp_list))
# Dictionary object with 2 key entries.
# - [kot]:
# - kot*
# - [pies]:
# - pies*
tokens_lookup(toks, dictionary = stem_dict, exclusive = FALSE, capkeys = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "ala" "ma" "kot"
##
## text2 :
## [1] "kasia" "ma" "pies"