2

Suppose we have following table

+-----+------------+
| id  | categories |
+-----+------------+
| id1 | [20,25]    |
| id2 | [25]       |
| id3 | [20,25,28] |
| id4 | [28,25]    |
| id5 | [20,25]    |
+-----+------------+

Field categories is JSON type. It contains only known and limited list of integers - like 20,25,28. So, I need somehow count all inclusions of all these values in a manner like this:

+-------+--------+
| count | number |
+-------+--------+
|    20 |      3 |
|    25 |      5 |
|    28 |      2 |
+-------+--------+

The main issue is to make this using single request without looping through categories numbers in server code or in procedure call

Head-on decision is the following

SELECT 
    COUNT(id) AS 'count', '20' AS number
FROM
    ml_categories
WHERE
    JSON_CONTAINS(categories, '20') 
UNION SELECT 
    COUNT(id) AS 'count', '25' AS number
FROM
    ml_categories
WHERE
    JSON_CONTAINS(categories, '25') 
UNION SELECT 
    COUNT(id) AS 'count', '28' AS number
FROM
    ml_categories
WHERE
    JSON_CONTAINS(categories, '28')

But this solution has O(n) complexity and code itself isn't good enough. For example, looping through ~500K records takes about 1 sec for one category on my hardware, thus counting 10 categories takes about 10 seconds. Not good. Is there a way to optimize such query?

Thx in advance, guys

Barbaros Özhan
  • 59,113
  • 10
  • 31
  • 55
Anton
  • 919
  • 7
  • 22

3 Answers3

3

But this solution has O(n) complexity

I'm not sure what n is in your case. But I'm sure that you won't find any solution which scales better than O(n).

Assuming the following numbers:

  • n: Number of items (rows in 'ml_categories' table)
  • m: Number of all categories
  • a: Average number of categories per item

Your query has a complexity of O(n*m). Even Bill Karwins solution (which I think is optimal) has a complexity of O(n*a) (assuming an index on category_id for the GROUP BY clause).

If you have a categories table, containing all categories, you could use the following query:

select c.id as category, count(*)
from ml_categories i
join categories c on json_contains(i.categories, cast(c.id as json))
group by c.id

It will return the same as your UNION query:

| category | count(*) |
| -------- | -------- |
| 20       | 3        |
| 25       | 5        |
| 28       | 2        |

View on DB Fiddle

And because no index can be used for the JOIN, it will be probably as fast or as slow as your query (in best case if an index can be used for GROUP BY avoiding a filesort).

If you use MySQL 8 (at least 8.0.4), you can utilize JSON_TABLE():

select c.category, count(*)
from ml_categories i
join json_table(
  i.categories,
  '$[*]' columns (category int path '$')
) c
group by c.category;

View on DB Fiddle

The JOIN with JSON_TABLE will "unpack" the categories from the JSON column into rows. If you remove the GROUP BY clause, then you will get a (dynamically) normalized table. This should scale with O(n*a). But since the table is created dynamically, there will be no index to support the GROUP BY clause. So the result must be sorted first, which ends up in a complexity of O(n*a * log(n*a)). This scales better than O(n*m) (if m grows and a don't). But if m (the number of categories) is small enough, your query might be still the best you can do with the given schema.

Paul Spiegel
  • 30,925
  • 5
  • 44
  • 53
  • Point taken, it's often a table-scan regardless, unless you restrict the values of category_id you're searching for. It might in theory be `O(a)` if the index on category_id includes a count of entries per distinct value of category_id. – Bill Karwin Dec 02 '19 at 21:01
  • 1
    Exactly.. You can't beat *O(n)* if you don't have a WHERE (or an equivalent) condition. But if you want to get all items for a specific category, the normalized schema is superior (*O(log(n))* vs *O(n)*). I though can't think of a problem, which can be solved with *O(a)*. If you want the result to contain all categories, you can't get it better than *O(m)* (e.g.: read the counts from a "cache" table). – Paul Spiegel Dec 02 '19 at 21:30
  • 1
    Regardless, I get the impression the OP is looking for a query that is simpler to code, in addition to having the best computational complexity. – Bill Karwin Dec 02 '19 at 22:27
  • Great! That's what I was looking for. Also came to using `cast` for switching types to build such query. A little gain in performance is also can be measured. Stackoverflow is great! As always :) – Anton Dec 03 '19 at 10:05
2

Storing comma-separated lists is not a relational strategy. It is denormalized. Denormalization, like all optimizations, optimizes for one type of query at the expense of other types of queries.

So it's no surprise that you have trouble optimizing the other types of queries.

The way to optimize this query is to avoid storing multi-valued attributes in comma-separated lists (or JSON, which is logically equivalent). Instead, store multi-valued attributes in rows, with one value per row, not in a comma-separated list or a JSON object. In other words, normalize your data.

Create a table for the many-to-many relationship between your 'things' (whatever they are) and categories:

CREATE TABLE things_have_categories (
  thing_id VARCHAR(10), 
  category_id INT, 
  PRIMARY KEY (category_id, thing_id)
);
INSERT INTO things_have_categories VALUES
('id1', 20),
('id1', 25),
('id2', 25),
('id3', 20),
('id3', 25),
('id3', 28),
('id4', 28),
('id4', 25),
('id5', 20),
('id5', 25);

Then you can write a simpler and optimized query like this:

SELECT category_id, COUNT(*) as count
FROM things_have_categories
GROUP BY category_id

Output:

+-------------+-------+
| category_id | count |
+-------------+-------+
|          20 |     3 |
|          25 |     5 |
|          28 |     2 |
+-------------+-------+

You may also like my answer to "Is storing a delimited list in a database column really that bad?"

You might reply, "but I can't change the way this table is stored."

I've heard that before. If that's the constraint, then you can't optimize the query. It is bound to be O(n).

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • Thanks. I know that your solution is more viable and better by design itself, but using JSON type was kind of experiment - including performance issues. That's why my question seemed a little weird to myself also - knowing that table itself should be better designed. – Anton Dec 03 '19 at 09:59
  • Well, the result of your experiment is: JSON makes it _easier_ to insert complex data (in this case, a list of category id's for each thing), but _harder_ to query that data. – Bill Karwin Dec 03 '19 at 15:33
0

You can use JSON_EXTRACT() function for each value (20,25 and 28) of all three components of the arrays, and then use Conditional Aggregation and then apply UNION ALL to combine all such queries:

SELECT 20 as count, sum(case when 20 in (comp1,comp2,comp3) then 1 end) as number
  FROM
  (SELECT JSON_EXTRACT(categories, '$[0]') as comp1, 
          JSON_EXTRACT(categories, '$[1]') as comp2,
          JSON_EXTRACT(categories, '$[2]') as comp3
     FROM ml_categories ) q1 
UNION ALL
SELECT 25 as count, sum(case when 25 in (comp1,comp2,comp3) then 1 end) as number
  FROM
  (SELECT JSON_EXTRACT(categories, '$[0]') as comp1, 
          JSON_EXTRACT(categories, '$[1]') as comp2,
          JSON_EXTRACT(categories, '$[2]') as comp3
     FROM ml_categories ) q2
UNION ALL
SELECT 28 as count, sum(case when 28 in (comp1,comp2,comp3) then 1 end) as number
  FROM
  (SELECT JSON_EXTRACT(categories, '$[0]') as comp1, 
          JSON_EXTRACT(categories, '$[1]') as comp2,
          JSON_EXTRACT(categories, '$[2]') as comp3
     FROM ml_categories ) q3 

Demo

Barbaros Özhan
  • 59,113
  • 10
  • 31
  • 55