Transilerate accents in MySql like LIKE

Question

I have a very long list of users. I want to group them by first letter of name. If the first letter is not a letter, it's grouped under #, so I have max 27 groups, for a-z + #.

I want to show only the gorup labels (e.g. F) if it will have results, and for every letter I want to know how many results it will have. So I do a single GROUP query to count all groups:

SELECT
  IF(lastname REGEXP '^[a-z]', UPPER(SUBSTRING(lastname, 1, 1)), '#') first_char,
  COUNT(1) num_users
GROUP BY first_char

That seems to work, BUT using REGEXP means that Ö isn't an O, but a #. That's a problem, because LIKE does find 'Ö' = 'O', so it will be in the O group when I name LIKE 'O%'. I could use REGEXP in the results query too, but I rather file Ö under O.

So the LIKE query works perfectly, but the GROUP query doesn't. How do I do exactly what LIKE does during comparisons?, so the group numbers and results always perfectly match.

Or another way to count correctly?

edit 1

Using LIKE a OR LIKE b OR .. OR LIKE z in the IF doesn't even work, because then the group might be Ö instead of O. The numbers will be correct, but the group label won't be. I really need a conversion...

edit 2

Thanks to @mpen. lastname REGEXP '^[[:alpha:]]' is shorter than 26 LIKEs, but the Ö label problem remains. Converting that outside MySQL is easy though.

Instead of `'^[a-z]'` can you try `[:alpha:]`? That should include accented characters I think, and then GROUP BY will *hopefully* group the same way that LIKE does (but I'm not sure). You may also need to `COLLATE` that first char into something like `utf8_unicode_ci` so it does the grouping properly. — mpen, Mar 09 '18 at 01:24
That works!, BUT it's not perfect, see **edit 1**. Column is `utf8_unicode_ci` so no collate necessary. Grouping works. — Rudie, Mar 09 '18 at 01:27
You can do something like [this](https://stackoverflow.com/a/6945191/65387) but you're probably better off 'deburring' the letters in whatever scripting language you're using. — mpen, Mar 09 '18 at 01:28

mpen · Accepted Answer · 2018-03-09T01:40:40.850

2

You can do the grouping like this:

select
    IF(name REGEXP '^[[:alpha:]]', UPPER(SUBSTRING(name, 1, 1)), '#') first_char,
    COUNT(1) num_users
from _grouptest
group by first_char

And then remove the accents in your scripting language of choice, or if you're brave, you can attempt to remove them in pure MySQL.

_.deburr in JS
Str::removeDiacritics from my PHP lib ptilz which was yoinked from WordPress

edited Mar 09 '18 at 01:40

answered Mar 09 '18 at 01:33

mpen

272,448
266
850
1,236

`[:alpha:]` uses the same dark hardcoded conversion magic as `LIKE`. Why don't they expose that? =( I read somewhere how they do that, in greater detail than in [the LIKE docs](https://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html#operator_like), but I can't find it anymore. I'm definitely not brave enough for manually transliterating. – Rudie Mar 09 '18 at 01:37
@Rudie Yeah, I'm not sure why. Maybe partly because each codepoint is assigned a number or weight, rather than being translated back to an unaccented character. i.e., they don't actually have a table that does this mapping? At least I think that's how it works. – mpen Mar 09 '18 at 01:40
1

They have a table for mapping `O` and `Ö` to the same marker/number/weight, but not for mapping that marker to `O`. I guess that makes sense. `O` is not more correct or better than `Ö` so they wouldn't. But I would! `[:alpha:]` saves the day! Many thanks. – Rudie Mar 09 '18 at 01:43
Yeah, that's what I mean. You said it better :-) – mpen Mar 09 '18 at 01:44

Transilerate accents in MySql like LIKE

1 Answers1