Lemmatization using txt file with lemmes in R

Question

I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/)

Abadan  Abadanem
Abadan  Abadanie
Abadan  Abadanowi
Abadan  Abadanu
abadańczyk  abadańczycy
abadańczyk  abadańczyka
abadańczyk  abadańczykach
abadańczyk  abadańczykami
abadańczyk  abadańczyki
abadańczyk  abadańczykiem
abadańczyk  abadańczykom
abadańczyk  abadańczyków
abadańczyk  abadańczykowi
abadańczyk  abadańczyku
abadanka    abadance
abadanka    abadanek
abadanka    abadanką
abadanka    abadankach
abadanka    abadankami

What packages and with what syntax, would allow me use such txt database to lemmatize my bag of words. I realize, for English there is Wordnet, but there is no luck for those who would like to use this functionality for rare languages.

If not, can this database be converted to be useful with any package that provides lemmatization? Perhaps by converting it to a wide form? For instance, the form used by free AntConc concordancer, (http://www.laurenceanthony.net/software/antconc/)

Abadan -> Abadanem, Abadanie, Abadanowi, Abadanu
abadańczyk -> abadańczycy, abadańczyka, abadańczykach 
etc.

In brief: How can lemmatization with lemmas in txt file be done in any of the known CRAN R text mining packages ? If so, how to format such txt file?

UPDATE: Dear @DmitriySelivanov I got rid of all diacritical marks, now I would like to apply it on tm corpus "docs"

docs <- tm_map(docs, function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm"))

and I tried it as tokenizer

LemmaTokenizer <- function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")

docsTDM <-
  DocumentTermMatrix(docs, control = list(wordLengths = c(4, 25), tokenize=LemmaTokenizer))

It throws at me an error:

 Error in lemma_hashmap[[tokens]] : 
  attempt to select more than one element in vectorIndex

The function works with a vector of texts as charm though.

@DmitriySelivanov I am sorry, of course your answer works great. Thank you very much. It is also a good study for me on how the function works. — Jacek Kotowski, Sep 05 '17 at 12:42

score 3 · Accepted Answer · answered Aug 21 '17 at 06:12

My guess is that here is nothing to do with text-mining packages for this task. You need just to replace word in a second column by word in a first column. You can do it with creating hashmap (for example https://github.com/nathan-russell/hashmap).

Below is example of how you can create "lemmatizing" tokenizer which you can easily use in text2vec (and I guess quanteda as well).

Contributions in order to create such "lemmatizing" package are very welcome - will be very useful.

library(hashmap)
library(data.table)
txt = 
  "Abadan  Abadanem
  Abadan  Abadanie
  Abadan  Abadanowi
  Abadan  Abadanu
  abadańczyk  abadańczycy
  abadańczyk  abadańczykach
  abadańczyk  abadańczykami
  "
dt = fread(txt, header = F, col.names = c("lemma", "word"))
lemma_hm = hashmap(dt$word, dt$lemma)

lemma_hm[["Abadanu"]]
#"Abadan"


lemma_tokenizer = function(x, lemma_hashmap, 
                           tokenizer = text2vec::word_tokenizer) {
  tokens_list = tokenizer(x)
  for(i in seq_along(tokens_list)) {
    tokens = tokens_list[[i]]
    replacements = lemma_hashmap[[tokens]]
    ind = !is.na(replacements)
    tokens_list[[i]][ind] = replacements[ind]
  }
  tokens_list
}
texts = c("Abadanowi abadańczykach OutOfVocabulary", 
          "abadańczyk Abadan OutOfVocabulary")
lemma_tokenizer(texts, lemma_hm)

#[[1]]
#[1] "Abadan"          "abadańczyk"      "OutOfVocabulary"
#[[2]]
#[1] "abadańczyk"      "Abadan"          "OutOfVocabulary"

Dear @DmitriySelivanov it seems not to work with diacritics. Latin2 letters like for Chech, Polish. the text is in UTF-8 and the dictionary is UTF-8, is it the hash table, that causes problems? Tt will not work with words with żółći, żółcią. — Jacek Kotowski, Sep 08 '17 at 11:51
Dear @DmitriySelivanov I got rid of all diacritical marks, now I would like to apply it on tm corpus "docs" I have added an UPDATE to my question to have clear formatting. — Jacek Kotowski, Sep 08 '17 at 14:26
sorry, have no idea why it doesn't work with `tm` and don't want to dig into it. Mb it worth to open separate question and other people will help. — Dmitriy Selivanov, Sep 08 '17 at 16:20

Lemmatization using txt file with lemmes in R

1 Answers1

Linked