0

I have a PostgreSQL table of Czech words (>1M rows) with a column named "word" [text] and I want to find all words with the same declination (see Czech declination) based on the word ending.

E.g. I want to find all words that end with "e" (e.g. kuře), but for which also exists word forms that end with "ete" (e.g. kuřete) and also "etem" (e.g. kuřetem) and also "eti" (e.g. kuřeti). For each word exists approx. 14 word forms.

What is an efficient way (SQL query) to find all words that match the rule?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
dev0experiment
  • 462
  • 1
  • 7
  • 22

2 Answers2

2

This is a case of relational division.

Assuming a table of UNIQUE words like:

CREATE TABLE words (word text PRIMARY KEY);

This should be among the fastest possible solutions:

SELECT w0.stem
FROM  (
   SELECT left(word, -4) AS stem  -- -4 = length('etem')
   FROM   words
   WHERE  word LIKE '%etem'  -- pick the most selective ending to get started
   ) w0
JOIN   words w1 ON w1.word = stem || 'eti'
JOIN   words w2 ON w2.word = stem || 'ete'
JOIN   words w3 ON w3.word = stem || 'e';

Finds all word stems that appear with all the given endings. More words starting with the same stem and different endings do not disqualify!

If you have to check for many endings (14?), it may get tedious to spell it all out. Shorter code, typically slower:

SELECT w0.stem
FROM  (
   SELECT left(word, -4) AS stem
   FROM   words
   WHERE  word LIKE '%etem'  -- pick the most selective ending to get started
   ) w0
CROSS  JOIN unnest ('{eti,ete,e}'::text[]) x(dec)  -- all other in an array
JOIN   words w1 ON w1.word = w0.stem || x.dec
GROUP  BY w0.stem
HAVING count(*) = 3;  -- = cardinality('{eti,ete,e}'::text[])

db<>fiddle here

Related:

Text search operators and indexes might be of interest. But you need a Czech stemmer first, which is not included in the standard Postgres distribution. Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
0

Hmmm . . . if I understand correctly, then this is not that efficient an operation. But I think that aggregation might be the fastest approach unless you dive into arcane indexing strategies (which even then might not work):

select left(word, length(word) - 1) || 'e'
from words w
where word ~ '(e|ete|etem)$'
group by left(word, length(word) - 1);
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786