0

I got data like this (simplified):

library(quanteda)

sample data

myText <- c("ala ma kotka", "kasia ma pieska")  
myDF <- data.frame(myText)
myDF$myText <- as.character(myDF$myText)

tokenization

tokens <- tokens(myDF$myText, what = "word",  
             remove_numbers = TRUE, remove_punct = TRUE,
             remove_symbols = TRUE, remove_hyphens = TRUE)

stemming with my own data sample dictionary

Origin <- c("kot", "pies")
Word <- c("kotek","piesek")

myDict <- data.frame(Origin, Word)

myDict$Origin <- as.character(myDict$Origin)
myDict$Word <- as.character(myDict$Word)

what i got

tokens[1]
[1] "Ala"   "ma"    "kotka"

what i would like to get

tokens[1]
[1] "Ala"   "ma"    "kot"
tokens[2]
[1] "Kasia"   "ma"    "pies"
Ken Benoit
  • 14,454
  • 27
  • 50
Garf
  • 75
  • 1
  • 12

1 Answers1

4

A similar question has been answered here, but since that question's title (and accepted answer) do not make the obvious link, I will show you how this applies to your question specifically. I'll also provide additional detail below to implement your own basic stemmer using wildcards for the suffixes.

Manually mapping stems to inflected forms

The simplest way to do this is by using a custom dictionary where the keys are your stems, and the values are the inflected forms. You can then use tokens_lookup() with the exclusive = FALSE, capkeys = FALSE options to convert the inflected terms into their stems.

Note that I have modified your example a little to simplify it, and to correct what I think were mistakes.

library("quanteda")
packageVersion("quanteda")
[1] ‘0.99.9’

# no need for the data.frame() call
myText <- c("ala ma kotka", "kasia ma pieska")  
toks <- tokens(myText, 
               remove_numbers = TRUE, remove_punct = TRUE,
               remove_symbols = TRUE, remove_hyphens = TRUE)

Origin <- c("kot", "kot", "pies", "pies")
Word <- c("kotek", "kotka", "piesek", "pieska")

Then we create the dictionary, as follows. As of quanteda v0.99.9, values with the same keys are merged, so you could have a list mapping multiple, different inflected forms to the same keys. Here, I had to add new values since the inflected forms in your original Word vector were not found in the myText example.

temp_list <- as.list(Word) 
names(temp_list) <- Origin
(stem_dict <- dictionary(temp_list))
## Dictionary object with 2 key entries.
## - [kot]:
##   - kotek, kotka
## - [pies]:
##   - piesek, pieska    

Then tokens_lookup() does its magic.

tokens_lookup(toks, dictionary = stem_dict, exclusive = FALSE, capkeys = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "ala" "ma"  "kot"
## 
## text2 :
## [1] "kasia" "ma"    "pies" 

Wildcarding all stems from common roots

An alternative is to implement your own stemmer using the "glob" wildcarding to represent all suffixes for your Origin vector, which (here, at least) produces the same results:

temp_list <- lapply(unique(Origin), paste0, "*")
names(temp_list) <- unique(Origin)
(stem_dict2 <- dictionary(temp_list))
# Dictionary object with 2 key entries.
# - [kot]:
#   - kot*
# - [pies]:
#   - pies*

tokens_lookup(toks, dictionary = stem_dict, exclusive = FALSE, capkeys = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "ala" "ma"  "kot"
## 
## text2 :
## [1] "kasia" "ma"    "pies" 
Ken Benoit
  • 14,454
  • 27
  • 50
  • 1
    I would like to say a very big "thank you" for your complete answer and time taken to explain it perfectly. – Garf Sep 28 '17 at 08:04