1

I'm a beginner in MySQL. I tried to write a query for finding the max N members per group. The following code does the job, which in this case is to get the top 2 persons per country with greatest id:

SELECT co.id, co.person, co.country
FROM person co
WHERE (
SELECT COUNT(*)
FROM person ci
WHERE  co.country = ci.country      -- controlling grouping column
AND co.id < ci.id                       -- controlling min or max 
) < 2                                           -- controlling number of return per group
;

However, I was unable to understand why this is working. Could someone please explain to me? It would be great if you can tell me:

  1. what is the result returned by the inner SELECT COUNT(*)? Isn't it just a single number?
  2. why use this particular inner WHERE ... AND ... clause?
  3. why the outer WHERE () < 2 controls the number of return per group?

Thank you so much for your help.

DDLs

CREATE TABLE `person` (
    `id` INT(11) NULL DEFAULT NULL,
    `country` VARCHAR(50) NULL DEFAULT NULL COLLATE 'utf8_bin',
    `person` VARCHAR(50) NULL DEFAULT NULL COLLATE 'utf8_bin'
)
COLLATE='utf8_bin'
ENGINE=InnoDB
;

INSERT INTO `person` (`id`, `country`, `person`) VALUES
 (1, 'Austria', 'Sue'),
(2, 'Austria', 'Anie'),
(3, 'Australia', 'John'),
(4, 'Australia', 'Brian'),
(5, 'UK', 'Jim'),
(6, 'UK', 'Tim'),
(7, 'USA', 'David'),
(8, 'USA', 'Mike'),
(9, 'USA', 'Tom'),
(10, 'N. Korea', 'Joe'),
(11, 'N. Korea', 'Hue'),
(12, 'N. Korea', 'Rick'),
(13, 'N. Korea', 'Jamy'),
(14, 'Finland', 'Kimi');
Strawberry
  • 33,750
  • 13
  • 40
  • 57
Smile
  • 113
  • 1
  • 7
  • 2
    This is correlated subquery. It is executed for each row from "main" table `co`, and the values for according columns are set as parameters in the subquery. – Akina Nov 10 '20 at 08:27
  • @Akina Thank you for pointing to me the concept correlated subquery. – Smile Nov 10 '20 at 08:47
  • utf8_bin should be utf8mb4_bin. Always use utf8mb4, not the old utf8 – ysth Nov 10 '20 at 08:48
  • Binary collation is strange for textual data in shown case.. – Akina Nov 10 '20 at 08:49

2 Answers2

0

The correlated subquery returns the number of persons with the same country and a higher id. So if you compare that with < 2 you will only select the two highest ids for each country. You may get a better idea of this by looking at the results of selecting the subquery instead of using it as a constraint:

SELECT co.id, co.person, co.country,
    (
        SELECT COUNT(*)
        FROM person ci
        WHERE  co.country = ci.country      -- controlling grouping column
        AND co.id < ci.id                       -- controlling min or max 
    ) AS higher_ids
FROM person co

https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=9c3cfe469dd299b3160d09e97e73e499

Note that unless there are optimizations that specifically are looking for this pattern, which I highly doubt, this query is going to be O(N^2).

ysth
  • 96,171
  • 6
  • 121
  • 214
  • Thanks. It looks like if we want max N attr considering repetitions, we need change the inner select to `SELECT COUNT(DISTINCT ci.attr)`. – Smile Nov 11 '20 at 03:20
  • Not sure what you mean by considering repetitions. I wouldn't consider this a good starting place for top N of any kind, unless you have very few records. I suggest that now your question about how this query works is answered, you ask a new question showing the problem you are actually trying to solve and giving your schema, what you've tried, sample data, and expected results for the sample data – ysth Nov 11 '20 at 03:32
  • thank you for your feedback! I've asked a more clearly formulated question about what I want to accomplish: https://stackoverflow.com/questions/64780798/mysql-efficient-max-n-values-per-group. – Smile Nov 11 '20 at 05:12
  • if this answer helped you, please consider upvoting it – ysth Nov 12 '20 at 18:08
0

This is a query that contains a subquery in where condition.

  1. The SELECT COUNT(*) return number only.
  2. The WHERE ... AND ... is used to connect the subquery table with the main query table via column "country"
  3. The statement WHERE (...) on the main query is to return the index of each record on the group

Here is other query that returns the same result, you can run subquery that independence with main query

SELECT *
FROM (
    SELECT co.id, co.person, co.country, 
    (
        SELECT COUNT(*)
        FROM person ci
        WHERE  co.country = ci.country AND co.id < ci.id                       
    ) AS 'sort_index' -- low index mean the max id
    FROM person co
) `p`
WHERE p.sort_index < 2
Dat Tran
  • 14
  • 1