Efficient full text search when the language is row dependant

Question

For the sake of this question I have two tables :

CREATE TABLE room (
    id serial primary key,
    lang varchar(12) NOT NULL default 'english'
);

CREATE TABLE message (
    id bigserial primary key,
    room integer references room(id),
    content text NOT NULL
);

and I want to have a full text search with a language dependant tokenization : the tokenization of messages and their search must depend on the language of the room.

The totally unoptimized and unindexed search would be like this :

select message.id, content, lang from message, room
where message.room=room.id
and to_tsvector(lang::regconfig, content)
   @@ plainto_tsquery(lang::regconfig,'what I search')
and room=33;

A search query is always done in one unique room (so the language is homogeneous).

Now my question is how to do this efficiently ? I can't directly build an expression index as the expressions used in indexes must be "immutable" (rely only on the indexed row).

Is the creation of a new column containing to_tsvector(lang::regconfig, content) (and maintained with a trigger) the only reasonable solution if I want to have an index ?

Is that the most efficient ?

score 2 · Accepted Answer · edited May 23 '17 at 12:20

2

If you know that the association between language and room does not change, you can feed this information to Postgres by way of an IMMUTABLE function.

CREATE OR REPLACE FUNCTION room_lang(int)
RETURNS varchar(12) AS
$$
   SELECT lang FROM room WHERE id = $1
$$ LANGUAGE sql IMMUTABLE;

And use this for partial indexes:

CREATE INDEX idx_en ON message ...
WHERE room_lang(room) = 'english';

CREATE INDEX idx_es ON message ...
WHERE room_lang(room) = 'spanish';

Of course, you have to recreate any such index, if you change anything in room that breaks the promise of "immutability", thereby breaking the index ...

Use a compatible WHERE clause for your queries to let Postgres know it can use the index:

SELECT ...
WHERE room_lang(room) = 'english';

Here is a related example for indexes with an IMMUTABLE function with a lot more details:
Does PostgreSQL support "accent insensitive" collations?

Aside: I'd rather use just text instead of varchar(12).

edited May 23 '17 at 12:20

Community

1
1

answered Feb 17 '14 at 17:13

Erwin Brandstetter

605,456
145
1,078
1,228

+1 before I dive into details. First a not very important one : why `text` ? – Denys Séguret Feb 17 '14 at 17:20
@dystroy: Why text? [To avoid problems like this one.](http://stackoverflow.com/questions/8524873/change-postgresql-columns-used-in-views/8527792#8527792) Recreate: Just `DROP INDEX ...; CREATE INDEX ...`. Remember to recreate *both* indexes after a *change*. – Erwin Brandstetter Feb 17 '14 at 17:22
Won't this be rather size ineffective (I might change the type to be 2 chars long in the future) ? – Denys Séguret Feb 17 '14 at 17:24
@dystroy: The change from `varchar(12)` to `text` has no effect on size whatsoever. More info: http://stackoverflow.com/questions/13524222/table-size-with-page-layout/13525950#13525950. Also follow the link to the manual for more. – Erwin Brandstetter Feb 17 '14 at 17:26
This makes me wonder : wouldn't it be more efficient to have a partial index per room (no search would be cross-room) ? This would also prevent a room language change to force a big index recreation. – Denys Séguret Feb 17 '14 at 17:28
@dystroy: partial indexes per room are a viable alternative - if you always search per room. You'll end up with lots of partial indexes, but all of them together are hardly bigger than an index on the complete table. I have similar solutions working for me. Still, you need to hard-code the language into the expression of the index ... Here is a [related answer on dba.SE](http://dba.stackexchange.com/questions/18300/can-spatial-index-help-a-range-order-by-limit-query/22500#22500) with infrastructure to create lots of partial indexes automatically: – Erwin Brandstetter Feb 17 '14 at 17:32
@dystroy: Out of time now, gotta run. I hope you can work with this. – Erwin Brandstetter Feb 17 '14 at 17:34
I'll see that (it may take a while as it's a personal project so don't expect an accept today). – Denys Séguret Feb 17 '14 at 17:36

jean · Answer 2 · 2014-02-17T17:10:44.947

-3

In MS SQL we got Full Text Search but I don't know if Postgres got anything similar. In the case your RDBMS don't offer a solution I created one years ago. At the time we cannot activate FTS in the shared server my client rent. So I created a full customized solution.

I wrote a article with the solution at: SQL Server Central

(Obs:. you ill need do create a free account in order to see the article)

The solution was writen for MS Sql but I bet it's easily portable to Postgres.

Also posted a example at: SQL Fiddle

I hope you don't need to write a full solution like I did and hope if you need it that article can easy your pain.

Note, the final solution worked like a charm (in production) but ended a bit more sophisticated.

edited Feb 17 '14 at 17:10

answered Feb 17 '14 at 16:58

jean

4,159
4
31
52

This is a **very** database specific question. And of course there's a full text search in Postgres, read the question. – Denys Séguret Feb 17 '14 at 17:02
Read the answer, there also in MS SQL but I ended forced to solve it another way and maybe you also be forced. Also pay attention to the "hope you don't need to..." and "portable to postgres" lines. – jean Feb 17 '14 at 17:09
There's a complete solution, handling tokenization, lexemization, stop words, and so on according to the language in PG. I won't use a hand made replacement to all that. I just want to most efficiently ensure the indexation is row dependant. – Denys Séguret Feb 17 '14 at 17:13

Efficient full text search when the language is row dependant

2 Answers2