How to implement a custom spell check in the search API of GAE

Question

In my python GAE application, I am allowing users to query on items using the search api where I initially put in the documents with the exact tags, but the hits are not much given the spell correction that needs to be present. The way I found was implementing character ngrams via datastore as this ensures that the user is typing in atleast a part of the word correctly. On the datastore this takes a lot of time. For example,

"hello" (is broken into) ["hello", "ello", "hell", "hel", "elo", "llo", "he", "el", "ll", "lo"]

and when i search for "helo" tags -["hel", "elo", "he", "el", "lo"] ( give a positive match)

I rank them according to the length of the tags matched from a word.

On Datastore, I have to index these break character ngrams separately along with the entities they match. And for each word perform the search on every tag in a similar manner. Which takes a lot of time.

Is there a way of achieving a similar operation using the search api. Does the MatchScore look into the multiple fields of "OR" ? Looking for ways to design the search documents and perform multiple spell corrected queries in minimal operations.

If I have multiple fields for languages in each document like for eg.-

([tags - "hello world"] [rank - 2300] [partial tags - "hel", "ell", "llo", "wor", "orl", "rld", "hell", "ello", "worl", "orld"] [english - 1] [Spanish - 0] [French - 0] [German - 0]

Can I perform a MatchScore operation along with sort on the language fields? (each document is associated to only one language)

score 2 · Answer 1 · edited May 23 '17 at 10:30

Search API is a good service for this and much better suited than datastore. If your search documents have the correct language set, Search API will cover certain language specific variations (e.g. singular / plural). But Search API only works for words (typically separated by spaces, hyphens, dots etc.).

UPDATE: Language is defined either in the language property of a field, or in the language property of the entire document. In either case, the value is a two-letter ISO 693-1 language code, for example 'de' for German.

For tokenizing search terms ("hel", "elo",...), you can use the pattern from this answer: https://stackoverflow.com/a/13171181/1549523 Also see my comment to that answer. When you want to use minimal length of tokens (e.g. only 3+ letters) to avoid storage size and frontend instance time, you can use the code I've linked there.

MatchScorer is helpful to weight the frequency of a given term in a document. Since tags typically occur only once per document, it wouldn't help you with that. But for example, if your search is about searching in research papers for the term "combustion", MatchScorer would rank the results, showing first the papers that have the term included most often.

Faceted search would add so called facets to the result of your search query, i.e. (by default) the 10 most often occurring facets for the current query is returned, too. This is helpful with tags or categories, so users can drill down their search by applying any of these suggested filters.

If you want to suggest users the correctly spelled search term, it might make sense to use two indices. One index, the primary index, for your actual search documents (e.g. product descriptions with tags), and a second index just for tags or categories (tokenized, and eventually with synonyms). If your user types into a search field, your app first queries the tag-index, suggesting matching tags. If the user selects one of them, the tag is used to query the primary search index. This would help users to pick up correct tags.

Those tags could be managed in the datastore of course, including their synonyms, if there are people maintaining such lists. And every time a tag is stored, your app updates the corresponding search document (in the secondary index) including all the character ngrams (tokens).

I tried with 3+ and 4+ variations and it works sluggishly mainly because I am not able to demarcate the languages. Say I have Spanish, Italian, German, French and English users. When a spanish user searches for something I want him to see Spanish results that match his query first and then English or other languages maybe and not do a strict filtering on Spanish (as is the case with "type" in search-api). Is there a way in which I can implement MatchScorer Along with Sort on language fields that may have binary value for each document ? — minocha, Oct 16 '15 at 05:36
@minocha I have added a note how you mark the language of a field or document, after I've noticed how you notated your sample fields in your question. Since I only have worked with language-agnostic indexing in my apps (only German audience), I'm not sure how Search API will treat results of different languages, but I believe it behaves already the way you want it. AFAIU, Search API will guess the language of a search term, then apply magic language rules for improved matching, but I don't expect that Search API will ignore fields or documents in general only because of a different language. — Ani, Oct 16 '15 at 08:24

How to implement a custom spell check in the search API of GAE

1 Answers1