1

I am refining a Search Auto-suggestion system, and after a few iterations and Normalization, have the following set of tables:

DB Fiddle: https://www.db-fiddle.com/f/b1FvGDkBMQXkREaMh3pHKi/0

Table 1: keywords - It stores a list of alphanumeric (relevant) keywords. Relevant details are:

CREATE TABLE keywords
  (
     keyword_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
     keyword    VARCHAR(32) NOT NULL
    /* Other fields eg: ranking, weights, etc */
  ) engine = innodb;

| keyword_id | keyword |
| ---------- | ------- |
| 1          | cotton  |
| 2          | jeans   |
| 3          | suit    |

Table 2: phrases - It represents a specific phrase. Earlier, I used to store the complete phrase text (with a UNIQUE constraint on the phrase), in this table. However, due to certain application related issues, data inconsistency happened. Also, this was duplication of data, hence I normalized it further:

CREATE TABLE phrases
  (
     phrase_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
    /* Other fields eg: ranking, weights, category details etc */
  ) engine = innodb;

| phrase_id |
| --------- |
| 1         | /* Earlier, a column phrase_text = "cotton jeans" existed as well */
| 2         | /* Earlier, a column phrase_text = "cotton suit" existed as well */
| 3         | /* This also represents "cotton jeans". Cant enforce uniqueness*/

Table 3: keywords_to_phrases - It is a mapping table; a phrase is single-space separated concatenation of the keywords, sorted by their position. Table details:

CREATE TABLE keywords_to_phrases
  (
    phrase_id INT UNSIGNED NOT NULL,
    keyword_id INT UNSIGNED NOT NULL,
    position TINYINT UNSIGNED NOT NULL DEFAULT 0,
    PRIMARY KEY (phrase_id, position),
    FOREIGN KEY (phrase_id) REFERENCES phrases(phrase_id),
    FOREIGN KEY (keyword_id) REFERENCES keywords(keyword_id)
  ) engine = innodb;

| phrase_id | keyword_id | position |
| --------- | ---------- | -------- |
| 1         | 1          | 0        |
| 1         | 2          | 1        |
| 2         | 1          | 0        |
| 2         | 3          | 1        |
| 3         | 1          | 0        |
| 3         | 2          | 1        |

So, in order to get a phrase text, the query would be:

SELECT p.phrase_id, 
       GROUP_CONCAT(k.keyword ORDER BY kp.position ASC
                    SEPARATOR ' ') AS phrase_text
FROM phrases p 
JOIN keywords_to_phrases kp ON kp.phrase_id = p.phrase_id 
JOIN keywords k ON k.keyword_id = kp.keyword_id 
GROUP BY p.phrase_id;

| phrase_id | phrase_text  |
| --------- | ------------ |
| 1         | cotton jeans |
| 2         | cotton suit  |
| 3         | cotton jeans |  /* Duplicate phrase text - cant enforce uniqueness */
  • The Primary Key on (phrase_id, position), ensures that a position is not repeated for the phrase.
  • This design also allows to avoid duplication of keyword(s) in a phrase. It can be achieved by UNIQUE constraint on (phrase_id, keyword_id).

However, I am unable to enforce the uniqueness of a phrase text. How can I ensure that no other phrase_id exists which will have the exact same set of (keyword_id, position) rows.

In the example above: phrase_id = 1 and 3 are duplicates. I could store them as JSON instead, and apply UNIQUE constraint on the JSON. But then, I am unable to index the JSON properly, and cannot utilize the Foreign Key constraint on keyword_id.

So, the question is: In MySQL, is there a way (smart trick) to achieve this unique constraint across the combination of rows (other than triggers), in the table design itself ?

I am also open to better table structure suggestions (if any; maybe I have over-normalized this !).

Madhur Bhaiya
  • 28,155
  • 10
  • 49
  • 57
  • Can you post only minimal code/question ?? – Prabhjot Singh Kainth Dec 27 '19 at 10:11
  • @PrabhjotSinghKainth what part do you feel is excess ? Table structure fiddle is here: https://www.db-fiddle.com/f/b1FvGDkBMQXkREaMh3pHKi/0 It is a design problem, and I thought that some explanation would be helpful. – Madhur Bhaiya Dec 27 '19 at 10:13
  • 1
    Don't be afraid of triggers. The pitiful small set of declarative SQL constraints implemented by typical DBMSs is inadequate for proper DB integrity. But also the "best" particular design/representation/encoding you choose depends on many other unspecified lhings including what sorts of queries you will be posing, set sizes, etc etc. So best to go with a straightforward design. Eg maintaining a unique id for every unique list you are representing as you input them. PS Eschew "basically", just clearly say what you mean once & give a name for shorthand future reference. – philipxy Dec 27 '19 at 10:48
  • @philipxy thanks for the comment; have trimmed the problem statement a bit. "*Eg maintaining a unique id for every unique list you are representing as you input them*" - Can you elaborate this further ? (in an answer maybe). As I understand, the unique id for me here is the `phrase_id`. Maybe I can store the list (as a separate column with UNIQUE constraint) in the `phrase` table (and autopopulate it using triggers ?). is that what you are implying ? – Madhur Bhaiya Dec 27 '19 at 11:26
  • Use the right abstraction(s) (abstract representation(s)) then map to implementation representations. (The essence of software design.) Phrases are lists. Just say they are lists. Right away. (Or strings mapped from lists--whatever.) Then talk about implementation representation in the DB. If they were some novel structure, you should give & name an appropriate abstract representation per specification needs not implementation needs. "Abstract" can still mean "relational DB" but not necessarily what your DBMS supports. – philipxy Dec 28 '19 at 00:30
  • So googling 'site:stackoverflow.com sql represent unique lists': [How to store a list in a column of a database table](https://stackoverflow.com/q/3070384/3404097) [What is the most correct way to store a “list” in a SQL Database?](https://stackoverflow.com/q/35493577/3404097) I don't know whether your question has a duplicate specifically re implemenation idioms for unique-constraining an abstract design that has a list-valued column. PS Beware--Rearrangements here are not all normalization in either of its senses of [to 1NF](https://stackoverflow.com/a/40640962/3404097) or higher NFs. – philipxy Dec 28 '19 at 00:41
  • What does `position` represent? Please provide an example. – Rick James Dec 28 '19 at 01:04
  • ... The current presentation is also somewhat backwards: It gives a table representation without saying what a phrase is & it then defines phrases in terms of tables. And "a phrase is single-space separated concatenation of the keywords, sorted by their position" is not clear--you are skimping on words. (But you should be mapping the other way.) Re lists being unique I meant in your query/view, as you mention here. – philipxy Dec 28 '19 at 01:07
  • @RickJames `position` or maybe `sort_order` is a positive integer. Keywords in a phrase will be positioned based on sorting of these values in ascending order. That makes sense ? So keyword with lowest position value, will come first in the phrase.. then a space character , then the keyword with second lowest position ..and so on.. – Madhur Bhaiya Dec 28 '19 at 04:07
  • @philipxy you are right that "phrase is a list/combination of keyword(s)". I just want this combination to be unique. So that no two phrases have exact same keyword list (in same order). Earlier, besides the mapping table, phrase table also stored the combination value and there was a unique constraint on it. But this I believe was duplication and one can change the mapping table without changing the stored list in the phrase table. Also, reason I gave so much detail and context in the question, as I thought maybe people have faced something like this before and have better tried approaches – Madhur Bhaiya Dec 28 '19 at 04:29
  • @MadhurBhaiya - Is the math problem of finding all _permutations_ of a set of words? – Rick James Dec 28 '19 at 04:46
  • No @RickJames . Sorry, I think my excessive problem statement has been confusing. So I have a list of relevant keywords, based on the content in my website. Unique permutations (not all permutations) of these keywords will make phrases, which are relevant for searching as well. For eg: `cotton jeans on sale` makes sense; but `jeans on cotton sale` does not. I am trying to store these unique permutations in the mapping table. Just unable to enforce the unique constraint in the table design itself. Hope that makes sense now ? – Madhur Bhaiya Dec 28 '19 at 05:07
  • @RickJames earlier I stored phrases only; but then we needed to rank/give weightage to certain keywords. Eg: `cotton` has priority than `cottonstone` . Also, certain keywords were synonyms/misspelt etc, so normalization required to have a separate master table of `keywords`. Now, `keywords` are mapped to `phrases` table, as described above. – Madhur Bhaiya Dec 28 '19 at 05:09

0 Answers0