3

Starting from this table with a utf8mb4_unicode_ci collation:

DROP TABLE IF EXISTS preferences;
CREATE TABLE IF NOT EXISTS preferences
(
    id      int auto_increment primary key,
    user_id int          not null,
    title   varchar(150) not null
) collate = utf8mb4_unicode_ci;

using these values:

INSERT INTO preferences (user_id, title) VALUES (1, 'Preference 1'), (1, 'Preference 2'), (1, 'Preference 3'), (1, 'Preference 4'), (1, 'Preference 5');

This query

SELECT CONCAT('"', GROUP_CONCAT(p.title SEPARATOR '","'), '"') AS title
FROM preferences AS p
GROUP BY p.user_id

with a group_concat_max_len=64 results in

title
"Preference 1","Preference 2","Preference 3","Preference 4","Pref"

But when using that query in a derived table, like so:

SELECT preferences_grouped.title
FROM (
    SELECT CONCAT('"', GROUP_CONCAT(p.title SEPARATOR '","'), '"') AS title
    FROM preferences AS p
    GROUP BY p.user_id
) as preferences_grouped

you'll get only 16 chars (64/4)

title
"Preference 1","Pr

This is not happening when the concat is removed or moved to the primary query:

SELECT CONCAT('"', preferences_grouped.title, '"')
FROM (
    SELECT GROUP_CONCAT(p.title SEPARATOR '","') AS title
    FROM preferences AS p
    GROUP BY p.user_id
) as preferences_grouped;
CONCAT('"', preferences_grouped.title, '"')
"Preference 1","Preference 2","Preference 3","Preference 4","Pref"

So it seems that adding the concat in combination with a derived table ànd a utf8mb4 charset causes the byte count for the chars in the result set to be quadrupled, meaning my group_concat_max_len should really be quadrupled too.

  1. Why is that?
  2. Can I do something using COLLATE to fix this / change this behavior?

Note: When the preferences table has a non-utf8mb4 charset, like the default latin1_swedish_ci collation, then there's also no byte-quadrupling going on.

Check the fiddle here: https://www.db-fiddle.com/f/hQD6r3pE5pmgR3n2XA7VtX/5

Shadow
  • 33,525
  • 10
  • 51
  • 64
Sneakyvv
  • 136
  • 5
  • Collations are used for ordering strings only. The character set influences how many bytes a character can take up. – Shadow Apr 09 '21 at 08:25
  • Ok but collations are a particular version of a charset, right? Anyhow, the question is mostly about why MySQL is acting (so strangly imo) like this, with the `concat(group_concat())` and derived table combination. Since the innerquery is giving a different result set when run separately. – Sneakyvv Apr 09 '21 at 10:11
  • No collations are not a version of character sets. Collations tell mysql how to order characters in relation to each other. A character set will have multiple different collations. – Shadow Apr 09 '21 at 10:33
  • To me the question is the other way around: how come group_concat does not use 4 bytes / character for utf8mb4 characters in the first instance – Shadow Apr 09 '21 at 10:38
  • >"_utf8mb4 uses a maximum of four bytes per character_" See [here](https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-conversion.html#:~:text=utf8mb4%20uses%20a%20maximum%20of%20four%20bytes%20per%20character). So, a simple 'a' would be one byte, but when using the combination in question, then suddenly _every_ char takes up 4 bytes :shrug: – Sneakyvv Apr 09 '21 at 11:17
  • That's the actual length, these buffers usually work with an assumption of max byte lengths per character. Same as max index length issues. I could not find anything useful in the mysql documentation. May be worthwhile to raise it as a bug on mysql's bug reporting website and see what they come back with. The behaviour definitely seems inconsistent. – Shadow Apr 09 '21 at 11:45
  • Ok, I was hesitating to raise a bug ticket, and wanted to check with some other people first, but I'll guess I'll have to do that then, thanks! – Sneakyvv Apr 09 '21 at 12:12
  • Pls do share the response you get. – Shadow Apr 09 '21 at 12:16
  • The [bug has been verified](https://bugs.mysql.com/bug.php?id=103279). Now we wait... :-) – Sneakyvv Apr 12 '21 at 13:35
  • Nice one! The categorisation is non-critical and v8 is unaffected, so probably not too high up on the bugfix priority list... – Shadow Apr 12 '21 at 16:58

1 Answers1

1

It's a bug in MySQL 5.x, and it has been reported as such.

Sneakyvv
  • 136
  • 5