In lunr a stemmer is implemented as a pipeline function. A pipeline function is executed against each word in a document when indexing the document, and each word in a search query when searching.
For a function to work in a pipeline it has to implement a very simple interface. It needs to accept a single string as input, and it must respond with a string as its output.
So a very simple (and useless) pipeline function would look like the following:
var simplePipelineFunction = function (word) {
return word
}
To actually make use of this pipeline function we need to do two things:
- Register it as a pipeline function, this allows lunr to correctly serialise and deserialise your pipeline.
- Add it to your indexes pipeline.
That would look something like this:
// registering our pipeline function with the name 'simplePipelineFunction'
lunr.Pipeline.registerFunction(simplePipelineFunction, 'simplePipelineFunction')
var idx = lunr(function () {
// adding the pipeline function to our indexes pipeline
// when defining the pipeline
this.pipeline.add(simplePipelineFunction)
})
Now, you can take the above, and swap out the implementation of our pipeline function. So, instead of just returning the word unchanged, it could use the greek stemmer you have found to stem the word, maybe like this:
var myGreekStemmer = function (word) {
// I don't know how to use the greek stemmer, but I think
// its safe to assume it won't be that different than this
return greekStem(word)
}
Adapting lunr to work with a language other than English requires more than just adding your stemmer though. The default language of lunr is English, and so, by default, it includes pipeline functions that are specialised for English. English and Greek are different enough that you will probably run into issues trying to index Greek words with the English defaults, so we need to do the following:
- Replace the default stemmer with our language specific stemmer
- Remove the default trimmer which doesn't play so nice with non-latin characters
- Replace/remove the default stop word filter, its unlikely to be much use on a language other than English.
The trimmer and stop word filter are implemented as pipeline functions, so implementing language specific ones would be similar for the stemmer.
So, to set up lunr for Greek you would have this:
var idx = lunr(function () {
this.pipeline.after(lunr.stemmer, greekStemmer)
this.pipeline.remove(lunr.stemmer)
this.pipeline.after(lunr.trimmer, greekTrimmer)
this.pipeline.remove(lunr.trimmer)
this.pipeline.after(lunr.stopWordFilter, greekStopWordFilter)
this.pipeline.remove(lunr.stopWordFilter)
// define the index as normal
this.ref('id')
this.field('title')
this.field('body')
})
For some more inspiration you can take a look at the excellent lunr-languages project, it has many examples of creating language extensions for lunr. You could even submit one for Greek!
EDIT Looks like I don't know the lunr.Pipeline
API as well as I thought, there is no replace
function, instead we just insert the replacement after the function to remove, and then remove it.
EDIT Adding this to help others in the future... It turns out the problem was down to the casing of the tokens within lunr. lunr wants to treat all tokens as lowercase, this is done, without any configurability, in the tokenizer. For most language processing functions this is not a problem, indeed, most assume words are lower cased. In this case, the Greek stemmer only stems uppercase words due to the complexity of stemming in Greek (I'm not a Greek speaker so can't comment on how much more complex that stemming is). A solution is to convert to upper case before calling the Greek stemmer, then convert back to lowercase before passing the tokens on to the rest of the pipeline.