MySQL GROUP BY not working correctly with umlaut characters?

Question

I noticed a problem with GROUP BY in a query I am currently trying to debug. I have a DB table with the following structure (reduced from actual real life):

CREATE TABLE IF NOT EXISTS `product_variants` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `product_id` int(11) unsigned NOT NULL DEFAULT '0',
  `pid_merchant` varchar(50) NOT NULL,
  `checksum` char(32) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `checksum` (`checksum`),
  KEY `product_id` (`product_id`),
) ENGINE=InnoDB  DEFAULT CHARSET=latin1;

In this table, I have the following 2 rows (among many other millions):

INSERT INTO `product_variants` (`id`, `product_id`, `pid_merchant`, `checksum`) VALUES
(525555236, 628702710, 'ARTüöäß111', 'af5334b1193bf171580c70813ac83327'),
(525555241, 628702710, 'ARTüöäß222', 'cfe50fd9c3ca29fd957b839892313f82');

The query I'm currently debugging is attempting to find duplicate entries in this table based on pid_merchant, in a very simple matter:

SELECT count(*), pv.* FROM product_variants pv WHERE pv.pid_merchant != '' GROUP BY pv.pid_merchant HAVING count(*) > 1

My problem is that both these results match, even though the actual pid_merchant values are different - one ends in 111, the other in 222. Does anyone know how to approach this issue? I already tried grouping by MD5() and HEX(), by changing collation to latin1_german2_ci, by forcing binary or utf8 conversion and many others - pretty much all I could think of.

Another weird thing is that it seems to confuse the values of Y and Ü (capital U with umlaute) while grouping (e.g. ABC-Y and ABC-Ü are considered as identical when grouping).

The server is running MySQL 5.5 on Ubuntu x64:

mysqld  Ver 5.5.29-0ubuntu0.12.04.2-log for debian-linux-gnu on x86_64 ((Ubuntu))

You are showing us two rows, where `pid_merchant` is clearly different and you are trying to find duplicates? — mogul, May 23 '13 at 14:51
The `pid_merchant` is clearly different, however they both match the GROUP BY, that was the point. However, switching to the ANSI syntax (as per @gbn) fixed the group by issue. — Elis, May 23 '13 at 14:57

score 1 · Accepted Answer · edited May 23 '17 at 12:31

1

This is not an umlaut (or accents generally) problem

It is how MySQL evaluates GROUP BY: it's non-standard and random. Standard SQL is this:

SELECT count(*),  pv.product_id, pv.pid_merchant
FROM product_variants pv
WHERE pv.pid_merchant != ''
GROUP BY pv.product_id, pv.pid_merchant
HAVING count(*) > 1

All non-aggregated columns should appear in the GROUP BY.

MySQL has "useful" MySQL extensions that remove this strict requirement. It happens often

edited May 23 '17 at 12:31

Community

1
1

answered May 23 '13 at 14:41

gbn

422,506
82
585
676

Thank you for the fast answer, it's way too easy to learn bad habits and fall from ANSI. Cheers! – Elis May 23 '13 at 14:51
You can make MySQL behave better with ONLY_FULL_GROUP_BY mode. See http://dev.mysql.com/doc/refman/5.5/en/group-by-extensions.html for more – gbn May 23 '13 at 15:03
Yes, I stumbled upon that bit of information during my research, but unfortunately that's not an option for this big mess I'm working with. Thanks again for the effort! – Elis May 23 '13 at 15:10

MySQL GROUP BY not working correctly with umlaut characters?

1 Answers1