0

First thing's first: what I am doing works perfectly fine. I'm just seeing if there is any room for improvements, and if how I'm doing things is standard and/or using good practices.

These are the tables in question:

  • item
  • topic
  • item_topic
  • item_like_audit.

This is my use case:

  • There are topic's that can contain many item's.
  • Each item can have N amount of likes on them.
  • For each like, a record is stored in the item_like_audit table, such that is can be queried at a later time for ranking purposes.

This is what the query is trying to achieve:

  • Get all items under a certain topic that received the most likes within the past 7 days.

Can the following query or underlying schema be improved in any way (for performance or memory gains)?

Query:

SELECT DISTINCT item.* FROM item

/* Match items under this specific topic */
JOIN topic
    ON topic.slug = ?
    AND topic.deleted_at IS NULL
JOIN item_topic
    ON item_topic.item_id = item.id
    AND item_topic.topic_id = topic.id
    AND item_topic.deleted_at IS NULL

/* Match items that have had "like" activity in the past 7 days */
JOIN item_like_audit
    ON item_like_audit.item_id = item.id
    AND item_like_audit.created_at <= (CURRENT_DATE + INTERVAL 7 DAY)
WHERE item.deleted_at IS NULL

/* Order by highest like count to lowest */
ORDER BY item.like_count DESC

/* Pagination */
LIMIT ? OFFSET ?

Schema:

CREATE TABLE item (
    id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,

    name VARCHAR(255) NOT NULL,
    slug VARCHAR(255) NOT NULL UNIQUE,
    tagline VARCHAR(255) NOT NULL,
    description VARCHAR(1000) NOT NULL,
    price FLOAT NOT NULL,
    like_count INT(10) NOT NULL DEFAULT 0,
    images VARCHAR(1000) NOT NULL,

    created_at TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
    deleted_at TIMESTAMP NULL DEFAULT NULL,

    PRIMARY KEY (id)
);

CREATE TABLE item_like_audit (
    id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,

    item_id INT(10) UNSIGNED NOT NULL,
    user_id INT(10) UNSIGNED NOT NULL,

    created_at TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,

    PRIMARY KEY (id),
    KEY `item_like_audit_created_at_index` (`created_at`)
);

CREATE TABLE topic (
    id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,

    name VARCHAR(255) NOT NULL,
    slug VARCHAR(255) NOT NULL UNIQUE,

    created_at TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
    deleted_at TIMESTAMP NULL DEFAULT NULL,

    PRIMARY KEY (id)
);

CREATE TABLE item_topic (
    id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,

    item_id INT(10) NOT NULL,
    topic_id INT(10) NOT NULL,

    created_at TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
    deleted_at TIMESTAMP NULL DEFAULT NULL,

    PRIMARY KEY (id)
);
Lansana Camara
  • 9,497
  • 11
  • 47
  • 87
  • 2
    NOTE: `created_at <= (CURRENT_DATE + INTERVAL 7 DAY)` doesn't look like "within past seven days".. seems like we'd want `created_at >= DATE(NOW()) + INTERVAL -7 DAY ` – spencer7593 Nov 08 '17 at 16:31
  • @spencer7593 Great catch! Thanks for that. I was getting a similar result set since the `created_at` field was set to today for everything in my seed data. – Lansana Camara Nov 08 '17 at 16:34

2 Answers2

1

Since you are only returning Item records, you could try this for possible improved performance:

select Item.* 
  from Item
 where Item.deleted_at is null
   and exists (select 1 from item_topic
                where item_topic.item_id = item.id
                  and itme_topic.deleted_at is null
                  and exists (select 1 from topic
                               where topic.id = item_topic.item_id
                                 and topic.deleted_at is null
                                 and topic.slug = ?))
   and exists (select 1 from item_like_audit
                where item_like_audit.item_id = item.id
                  and item_liek_audit.created_at >= (current_date - interval 7 day))
 order by Item.like_count desc

This can potentially improve performance since:

  • You don't need the DISTINCT operator
  • The Database only has to find 1 row from each supporting table that matches the constraints instead of all matching records.
Sentinel
  • 6,379
  • 1
  • 18
  • 23
  • In this case, I actually want all of the results based on the `LIMIT` and `OFFSET`. So if I have `LIMIT 10` , I want 10 results, not just one. The result set is essentially a "top ranked list", as far as how it will be displayed in the UI. – Lansana Camara Nov 08 '17 at 16:56
  • In the second bullet point, I'm not indicating that you will only receive 1 row from table Items, but that Tables Item_Like_Audit, Item_Topic, and Topic need only be searched until 1 matching row is found for the constraints relative to the correlated record from table Item. An existence test returns true if at least one record in the subquery is returned. The subquery could logically return more than one record, but the database only needs to find one record to prove the existence test as true. – Sentinel Nov 08 '17 at 17:05
  • There are no joins in my query. It only returns records from table Item. Since Item.ID is the unique primary key, all returned records will be unique, non duplicate, and hence the DISTINCT operator is not needed. – Sentinel Nov 08 '17 at 17:06
  • @Sentinel; i missed that you were doing EXISTS on item_topic, and not a join. (For performance, we are definitely going to want suitable indexes for those correlated subqueries). – spencer7593 Nov 08 '17 at 17:06
  • Ah I see, thanks for the explanation. And as for the two columns being unique, in fact I may need to make them unique now because a user should only be allowed to like a specific item once. – Lansana Camara Nov 08 '17 at 17:07
  • You can further improve `item_topic` by getting rid of `id` and promoting the `UNIQUE(item, topic)` to `PRIMARY`. See http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table for that and more suggestions. – Rick James Nov 10 '17 at 17:12
1

Assuming item_topic(item_id,topic_id) is unique, we could do away with the "Using filesort" operation by getting rid of the DISTINCT keyword, and rewriting the check of item_like_audit as an EXISTS correlated subquery instead of a JOIN operation.

We'd have a guarantee of the uniqueness if we had

  CREATE UNIQUE INDEX item_topic_UX1 ON item_topic (topic_id, item_id);

We already have guarantees of uniqueness for topic(slug), topic(id), item(id), ...

  SELECT item.* 
    FROM item

/* Match items under this specific topic */
    JOIN item_topic
      ON item_topic.item_id = item.id
     AND item_topic.deleted_at IS NULL
    JOIN topic
      ON topic.id    = item_topic.topic_id
     AND topic.slug  = ?
     AND topic.deleted_at IS NULL

   WHERE item.deleted_at IS NULL
/* Match items that have had "like" activity in the past 7 days */
     AND EXISTS ( SELECT 1
                    FROM item_like_audit
                   WHERE item_like_audit.item_id = item.id
                     AND item_like_audit.created_at >= DATE(NOW()) + INTERVAL -7 DAY
                 )

/* Order by highest like count to lowest */
  ORDER BY item.like_count DESC

For improved performance of the correlated subquery, we could create a covering index

  CREATE INDEX item_like_audit_IX1 ON item_like_audit (item_id, created_at)

We expect the unique index we created earlier will be used for the join operation, so that should also improve performance. We could get a covering index if we included deleted_at column

  CREATE INDEX item_topic_IX2 ON item_topic (topic_id, item_id, deleted_at)

That is redundant with the unique index we created earlier, if we still want to guarantee uniqueness, flip the order of the columns around...

  DROP INDEX item_topic_UX1 ON item_topic ;
  CREATE UNIQUE INDEX item_topic_UX1 ON item_topic (item_id,topic_id);

If we don't have guaranteed uniqueness, then I would favor adding a GROUP BY item.id clause over a DISTINCT keyword.


Use EXPLAIN to see the execution plan, and verify that appropriate indexes are being used.


If we can't guarantee uniqueness of (item_id,topic_id) from item_topic, and the overhead of the "Using filesort" operation for the GROUP BY operation is still too high,

We could try checking the "matching topic" condition using an EXISTS. (But I don't hold out much hope that this will be any faster.)

  SELECT item.*
    FROM item
   WHERE item.deleted_at IS NULL
     AND EXISTS ( SELECT 1
                    FROM topic
                    JOIN item_topic
                      ON item_topic.item_id    = item.id
                     AND item_topic.topic_id   = topic.id
                     AND item_topic.deleted_at IS NULL
                    JOIN item_like_audit 
                      ON item_like_audit = item.id
                     AND item_like_audit.created_at >= DATE(NOW()) + INTERVAL -7 DAY 
                   WHERE topic.slug  = ?
                     AND topic.deleted_at IS NULL
                )
  ORDER BY item.like_count DESC

We are going to need to have suitable indexes available for performance of the correlated subquery.

spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • The `CREATE INDEX` statements are recommendations for *schema* changes. – spencer7593 Nov 08 '17 at 17:01
  • So is it generally a good practice to create indexes on join tables where I am keeping id's of various tables for relationship purposes? Currently I have no indexes anywhere. Also, why does it matter what order the columns are when creating the index? – Lansana Camara Nov 08 '17 at 17:20
  • We add indexes for basically two reasons: 1) to enforce uniqueness and 2) to improve performance and reduce contention. There are indexes defined on the tables. The PRIMARY KEY of a table is an index. Also, a unique index is created on topic(slug) because of the UNIQUE keyword. (Run a `SHOW CREATE TABLE topic`, and you'll see that there is a UNIQUE INDEX declared.) The order of the columns in an index matters in terms of how the optimizer can make use of the index. An index on foo(a,b) isn't effective for range scan `SELECT b FROM foo WHERE b = ?`, because b is not a leading column. – spencer7593 Nov 08 '17 at 17:30
  • I see, so in your example you arrange the index `item_topic_IX2` as `(topic_id, item_id, deleted_at)`, however the join query using those 3 columns uses `item_id` first, then `topic_id`, then `deleted_at`. Should the index arrangement be in that order instead, or am I misunderstanding? – Lansana Camara Nov 08 '17 at 17:33
  • 1
    ... in the case of a query `SELECT a,b FROM foo WHERE a=? and b=?, that index would be fine, as would an index on foo(b,a). But we don't need redundant indexes, and sometimes it comes down to cardinality, how frequently values occur/repeat, and selective the query is.If you have a questions about why we would choose index on foo(a,b) over index on foo(b,a), feel free to search StackOverflow, and if you don't find an answer, ask another question. – spencer7593 Nov 08 '17 at 17:34
  • The order of the conditions in an ON clause or WHERE shouldn't make much difference. A predicate is a predicate is a predicate. What matters is that there's an equality condition on slug, which gets us a single value of topic_id. That's why I want topic_id as the leading column in the index. I want both item_id and created_at in the index, so that checks can be performed against index pages, without lookups to the underlying data pages. We could experiment with different orders for the last two columns, but I suspect having item_id before created_at would make the index suitable for other que – spencer7593 Nov 08 '17 at 17:38
  • Thanks for the elaborate help. Last point, I suppose I should be using subqueries instead of JOIN's, because I am using the JOIN's in a "get things from A, conditional on things from B" kind of way, as opposed to merging the data together. Basing off of this: https://stackoverflow.com/questions/2577174/join-vs-sub-query – Lansana Camara Nov 08 '17 at 17:41
  • 1
    Typically JOIN operations give the most efficient execution plans, given suitable indexes. Correlated subqueries can sometimes be optimized into "join performance", but when the query plan does an execution of the subquery for every row in the outer select, that can get expensive. It will eat your lunch. And your lunchbox too. I dropped to using an EXISTS (subquery) to avoid introducing duplicate rows into the result set that will need to be later eliminated with a "Using filesort" operation. We should test both ways, with realistic data volume/distribution, to determine which is better. – spencer7593 Nov 08 '17 at 17:43