2

I have a postgresql database with a users table, where each user has a name (in unicode). I'd like to find all users whose name contains at least one Hebrew character. I've thought using regex, e.g.

select * from users
where name ~ '[א-ת]';

Is there a more efficient approach than the regex? I have a B-tree index on the names column.

Update

Using the different indices with the pg_trgm module as suggested by @FuzzyTree

      B-tree GIST  GIN
user  0.04   0.04  0.03
sys   0.02   0.04  0.01
total 0.06   0.08  0.04

Regarding disk size, the GIN index is 0.2x of the GIST one, and 0.8x of the B-tree. So, we have a winner here, at least for my use case. YMMV (e.g. I haven't benchmarked index creation and update). Version: postgres 9.6.

dimid
  • 7,285
  • 1
  • 46
  • 85
  • 1
    Come on, with more than 2k rep, you should already what we want and that you are likely to get more useful answers when you provide some code snippets. – Jan Oct 01 '17 at 18:27
  • @Jan, a snippet for what? I'm asking *what* to do, rather than *how* to do something. I can add a snippet for my current regex code if that helps. – dimid Oct 01 '17 at 18:35

1 Answers1

1

One option is to create a boolean column i.e. is_hebrew_name that you can update once using your regex and create a regular index on.

If you don't want to add another column and you're running v9.3 or higher, then consider using the pg_trgm module to create a GIN or GIST index on name

CREATE EXTENSION pg_trgm;
CREATE INDEX trgm_idx ON users USING GIST (name gist_trgm_ops);

The pg_trgm module provides GiST and GIN index operator classes that allow you to create an index over a text column for the purpose of very fast similarity searches. These index types support the above-described similarity operators, and additionally support trigram-based index searches for LIKE, ILIKE, ~ and ~* queries.

The index search works by extracting trigrams from the regular expression and then looking these up in the index. The more trigrams that can be extracted from the regular expression, the more effective the index search is. Unlike B-tree based searches, the search string need not be left-anchored.

For both LIKE and regular-expression searches, keep in mind that a pattern with no extractable trigrams will degenerate to a full-index scan.

The choice between GiST and GIN indexing depends on the relative performance characteristics of GiST and GIN, which are discussed elsewhere.

For more info see https://www.postgresql.org/docs/9.6/static/pgtrgm.html

dimid
  • 7,285
  • 1
  • 46
  • 85
FuzzyTree
  • 32,014
  • 3
  • 54
  • 85
  • Thanks, I'll [test](https://stackoverflow.com/questions/21830/postgresql-gin-or-gist-indexes#26398) both and will report the result. – dimid Oct 01 '17 at 22:54