1

I have a very long list of users. I want to group them by first letter of name. If the first letter is not a letter, it's grouped under #, so I have max 27 groups, for a-z + #.

I want to show only the gorup labels (e.g. F) if it will have results, and for every letter I want to know how many results it will have. So I do a single GROUP query to count all groups:

SELECT
  IF(lastname REGEXP '^[a-z]', UPPER(SUBSTRING(lastname, 1, 1)), '#') first_char,
  COUNT(1) num_users
GROUP BY first_char

That seems to work, BUT using REGEXP means that Ö isn't an O, but a #. That's a problem, because LIKE does find 'Ö' = 'O', so it will be in the O group when I name LIKE 'O%'. I could use REGEXP in the results query too, but I rather file Ö under O.

So the LIKE query works perfectly, but the GROUP query doesn't. How do I do exactly what LIKE does during comparisons?, so the group numbers and results always perfectly match.

Or another way to count correctly?

edit 1

Using LIKE a OR LIKE b OR .. OR LIKE z in the IF doesn't even work, because then the group might be Ö instead of O. The numbers will be correct, but the group label won't be. I really need a conversion...

edit 2

Thanks to @mpen. lastname REGEXP '^[[:alpha:]]' is shorter than 26 LIKEs, but the Ö label problem remains. Converting that outside MySQL is easy though.

Rudie
  • 52,220
  • 42
  • 131
  • 173
  • 2
    Instead of `'^[a-z]'` can you try `[:alpha:]`? That should include accented characters I think, and then GROUP BY will *hopefully* group the same way that LIKE does (but I'm not sure). You may also need to `COLLATE` that first char into something like `utf8_unicode_ci` so it does the grouping properly. – mpen Mar 09 '18 at 01:24
  • That works!, BUT it's not perfect, see **edit 1**. Column is `utf8_unicode_ci` so no collate necessary. Grouping works. – Rudie Mar 09 '18 at 01:27
  • You can do something like [this](https://stackoverflow.com/a/6945191/65387) but you're probably better off 'deburring' the letters in whatever scripting language you're using. – mpen Mar 09 '18 at 01:28

1 Answers1

2

You can do the grouping like this:

select
    IF(name REGEXP '^[[:alpha:]]', UPPER(SUBSTRING(name, 1, 1)), '#') first_char,
    COUNT(1) num_users
from _grouptest
group by first_char

And then remove the accents in your scripting language of choice, or if you're brave, you can attempt to remove them in pure MySQL.

mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • `[:alpha:]` uses the same dark hardcoded conversion magic as `LIKE`. Why don't they expose that? =( I read somewhere how they do that, in greater detail than in [the LIKE docs](https://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html#operator_like), but I can't find it anymore. I'm definitely not brave enough for manually transliterating. – Rudie Mar 09 '18 at 01:37
  • @Rudie Yeah, I'm not sure why. Maybe partly because each codepoint is assigned a number or weight, rather than being translated back to an unaccented character. i.e., they don't actually have a table that does this mapping? At least I think that's how it works. – mpen Mar 09 '18 at 01:40
  • 1
    They have a table for mapping `O` and `Ö` to the same marker/number/weight, but not for mapping that marker to `O`. I guess that makes sense. `O` is not more correct or better than `Ö` so they wouldn't. But I would! `[:alpha:]` saves the day! Many thanks. – Rudie Mar 09 '18 at 01:43
  • Yeah, that's what I mean. You said it better :-) – mpen Mar 09 '18 at 01:44