Scroll to the end to skip the explanation.
Background
In my Android app, I want to use non-English Unicode text strings to search for matches in text documents/fields that are stored in a SQLite database. I've learned (so I thought) that what I need to do is implement a Full Text Search with fts3/fts4, so that is what I have been working on learning for the past couple days. FTS is supported by Android, as is shown in the documentation Storing and Searching for Data and in the blog post Android Quick Tip: Using SQLite FTS Tables.
Problem
Everything was looking good, but then I read the March 2012 blog post The sorry state of SQLite full text search on Android, which said
The first step when building a full text search index is to break down the textual content into words, aka tokens. Those tokens are then entered into a special index which lets SQLite perform very fast searches based on a token (or a set of tokens).
SQLite has two built-in tokenizers, and they both only consider tokens consisting of US ASCII characters. All other, non-US ASCII characters are considered whitespace.
After that I also found this StackOverflow answer by @CL. (who, based on tags and reputation, appears to be an expert on SQLite) replying to a question about matching Vietnamese letters with different diacritics:
You must create the FTS table with a tokenizer that can handle Unicode characters, i.e., ICU or UNICODE61.
Please note that these tokenizers might not be available on all Android versions, and that the Android API does not expose any functions for adding user-defined tokenizers.
This 2011 SO answer seems to confirm that Android does not support tokenizers beyond the two basic simple
and porter
ones.
This is 2015. Are there any updates to this situation? I need to have the full text search supported for everyone using my app, not just people with new phones (even if the newest Android version does support it now).
Potential partial solution?
I find it hard to believe that FTS does not work at all with Unicode. The documentation for the simple
tokenizer says
A term is a contiguous sequence of eligible characters, where eligible characters are all alphanumeric characters and all characters with Unicode codepoint values greater than or equal to 128. All other characters are discarded when splitting a document into terms. Their only contribution is to separate adjacent terms. (emphasis added)
That gives me hope that some basic Unicode functionality could still be supported in Android, even if things like capitalization and diacritics (and various other equivalent letter forms that have different Unicode code points) are not supported.
My Main Question
Can I use SQLite FTS in Android with non-English Unicode text (codepoints > 128) if I am only using literal Unicode string tokens separated by spaces? (That is, I am searching for exact strings that occur in the text.)
Updates
- The unicode61 tokenizer is available in SQLite version 3.7.13. This tokenizer supports "full unicode case folding" and "recognizes unicode space and punctuation characters." Android Lollipop (API 20+) uses SQLite 3.8.