Index for comparing to beginning of every word in a column

Question

So I have a table

id | name            | gender
---+-----------------+-------
0  | Markus Meskanen | M
1  | Jack Jackson    | M
2  | Jane Jackson    | F

And I've created an index

CREATE INDEX people_name_idx ON people (LOWER(name));

And then I query with

SELECT * FROM people WHERE name LIKE LOWER('Jack%');

Where %(name)s is the user's input. However, it now matches only to the beginning of the whole column, but I'd like it to match to the beginning of any of the words. I'd prefer not to use '%Jack%' since it would also result into invalid results from the middle of the word.

Is there a way to create an index so that each word gets a separate row?

Edit: If the name is something long like 'Michael Jackson's First Son Bob' it should match to the beginning of any of the words, i.e. Mich would match to Michael and Fir would match to First but ackson wouldn't match to anything since it's not from the beginning.

Edit 2: And we have 3 million rows so performance is an issue, thus I'm looking at indexes mostly.

the shortest change would be `lower('% Jack %')` to hit Jack in text, but using regular expressions here seems reasonable — Vao Tsun, Jan 31 '17 at 11:33
@VaoTsun That wouldn't find `Jack Jackson` since he's got no space in front of his name, or would it find `Jane Jackson` since she's got no space after her surname — Markus Meskanen, Jan 31 '17 at 11:39
@TimBiegeleisen It should match it just like first and last names — Markus Meskanen, Jan 31 '17 at 11:39
I wonder why no one mentioned yet, but the [`pg_trgm` module](https://www.postgresql.org/docs/current/static/pgtrgm.html) *is* what you are looking for. Its `gist_trgm_ops` and `gin_trgm_ops` based indexes enhances `LIKE`, `ILIKE`, `~` and `~*` queries (you don't have to create a `LOWER(...)` expression-based index for it). — pozs, Jan 31 '17 at 15:38

score 2 · Answer 1 · answered Jan 31 '17 at 11:39

2

Postgres has two index types to help with full text searches: GIN and GIST indexes (and I think GIN is the more commonly used one).

There is a brief overview of the indexes in the documentation. There is more extensive documentation for each index class, as well as plenty of blogs on the subject (here is one and here is another).

These can speed up the searches that you are trying to do.

answered Jan 31 '17 at 11:39

Gordon Linoff

1,242,037
58
646
786

Hmm this seems nice, i'll have to look further into it – Markus Meskanen Jan 31 '17 at 11:44

score 1 · Accepted Answer · edited May 23 '17 at 12:08

The pg_trgm module does exactly what you want.

You need to create either:

CREATE INDEX people_name_idx ON people USING GIST (name gist_trgm_ops);

Or:

CREATE INDEX people_name_idx ON people USING GIN (name gin_trgm_ops);

See the difference here.

After that, these queries could use one of the indexes above:

SELECT * FROM people WHERE name ILIKE '%Jack%';
SELECT * FROM people WHERE name ~* '\mJack';

As @GordonLinoff answered, full text search is also capable of searching by prefix matches. But FTS is not designed to do that efficiently, it is best in matching lexemes. Though if you want to achieve the best performace, I advise you to give it a try too & measure each. In FTS, your query looks something like this:

SELECT * FROM people WHERE to_tsvector('english', name) @@ to_tsquery('english', 'Jack:*');

Note: however if your query filter (Jack) comes from user input, both of these queries above needs some protection (i.e. in the ILIKE one you need to escape % and _ characters, in the regexp one you need to escape a lot more, and in the FTS one, well you'll need to parse the query with some parser & generate a valid FTS' tsquery query, because to_tsquery() will give you an error if its parameter is not valid. And in plainto_tsquery() you cannot use a prefix matching query).

Note 2: the regexp variant with name ~* '\mJack' will work best with english names. If you want to use the whole range of unicode (i.e. you want to use characters, like æ), you'll need a slightly different pattern. Something like:

SELECT * FROM people WHERE name ~* '(^|\s|,)Jack';

This will work with most of the names, plus this will work like a real prefix match with some old names too, like O'Brian.

McNets · Answer 3 · 2017-01-31T12:24:43.223

0

You can use Regex expressions to find text inside name:

create table ci(id int, name text);

insert into ci values
(1, 'John McEnroe Blackbird Petrus'),
(2, 'Michael Jackson and Blade');

select id, name
from ci
where name ~ 'Pe+'
;

Returns:

    1   John McEnroe Blackbird Petrus

Or can use something similar where substring(name, <regex exp>) is not null

Check it here: http://rextester.com/LHA16094

edited Jan 31 '17 at 12:24

answered Jan 31 '17 at 12:06

McNets

10,352
3
32
61

score -1 · Answer 4 · answered Jan 31 '17 at 11:34

-1

If you know that the words are space separated, You can do

SELECT * FROM people WHERE name LIKE LOWER('Jack%') or  name LIKE LOWER(' Jack%')  ;

For more control you can use RegEx with MySQl

see https://dev.mysql.com/doc/refman/5.7/en/regexp.html

answered Jan 31 '17 at 11:34

Mithilesh Gupta

2,800
1
17
17

2

The OP is using Postgres, not MySQL. – Tim Biegeleisen Jan 31 '17 at 11:35
This doesn't utilize indexing. I have 3,000,000+ rows of data, using `LIKE LOWER('% Jack%')` would be extremely slow. – Markus Meskanen Jan 31 '17 at 11:49

Index for comparing to beginning of every word in a column

4 Answers4