2

I have a table of character names in a mySQL database.

I am trying to query the table and sort them alphabetically by name.

Some of the characters have names like "The Dagda" and the "The " needs to be ignored so I am attempting to use:

select character_id, name from characters where is_del=0 order by trim('The ' from name)

Which seems to work...

Some of the other characters have UTF-8 characters in their names such as "Ériu"

However now when my table is returned I get these "É" entries listed between "A" & "B".

I.E.:

Aengus Amergin Ériu Balor Banba etc.

Preservation of these UTF characters is crucially important on the front end.

Does anyone know a method where I could have these "É" characters and similar be represented as "E" for purposes of sorting, but will still render in the dataset as what they actually are?

I am thinking before asking this that this may not be possible but I am hoping someone here might have run into a similar problem before and might have a workaround.

Thanks in advance.

EDIT: changed UTF-16 to UTF-8 (my bad)

EDIT @Rick James :

I could not format this readably in a comment but the hex of the query is as follows:

name | hex(name)

Aengus Óg | 41656E67757320C383E2809C67
Amergin | 416D657267696E
Ériu | C383E280B0726975
Balor | 42616C6F72
Banba | 42616E6261

The 3rd item down is Ériu - I am not sure why they are rendering as above but this is what is being displayed through the phpmyadmin interface when I run the query select character_id, name, hex(name) from characters order by trim('The ' from name)

The first character's full name should be Aengus Óg (I am assuming this is again down to character set or collation but I am unsure so apologies for the ignorance on my part here)

  • 1
    Sounds like you need to find a proper _collation_ to use. Which one are you currently using? – CBroe May 16 '22 at 12:34
  • The database is probably UTF-8, not UTF-16, but the question is valid (and interesting). – Pointy May 16 '22 at 12:35
  • Sorry my bad @Pointy is correct it's utf8_general_ci I will edit the question now – insecureabnormality May 16 '22 at 13:32
  • For MYSQL, use utf8mb4_general – Grumpy May 16 '22 at 13:34
  • @Grumpy Do you mean if I change the collation of the table or the database as a whole that this will solve the problem?? – insecureabnormality May 16 '22 at 14:44
  • @insecureabnormality MYSQL itself advices to use utf8mb4_general, im not sure if it will solve your problem – Grumpy May 16 '22 at 16:11
  • 1
    @insecureabnormality This wil solve your problem https://dev.mysql.com/doc/refman/8.0/en/charset-collate.html – Grumpy May 16 '22 at 16:12
  • @Grumpy so I've updated the query as you're saying `select character_id, name from characters order by trim('The ' from name) collate utf8mb4_general_ci` but now it's just throwing me an error saying `"COLLATION 'utf8mb4_general_ci' is not valid for CHARACTER SET 'utf8'`. I have also tried `select character_id, name collate utfmb4_general_ci as name1 from characters where is_del=0 order by trim('The ' from name1)` but it gives me the same error. I have a feeling this is definitely the right area to fix the problem but any further thoughts would be much appreciated. – insecureabnormality May 17 '22 at 13:17
  • My previous comment was deleted due to erroneous information - file was not updating on server – insecureabnormality May 17 '22 at 13:18
  • @insecureabnormality Change the character set of your table to utf8mb4_general_ci. More info here https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#:~:text=MySQL%20supports%20multiple%20Unicode%20character,you%20should%20use%20utfmb4%20instead. – Grumpy May 17 '22 at 14:05
  • Sorry for the delay in coming back, I've tried doing this and the problem still remains. Database, tables and columns are now all set to utf8mb4_general_ci and the character set is changed to utf8mb4 Any other thoughts/ideas? – insecureabnormality May 20 '22 at 10:23
  • Please provide the hex of the column with `Ériu` in it; there is no way (that I know of) for this to be misinterpreted as being between A and B. – Rick James May 23 '22 at 05:43
  • In Icelandic, `É` comes after `EZ` and before `F`. – Rick James May 23 '22 at 05:45
  • I would advise against using UTF-16 or ucs2 unless your external source provides such. Again, let's see the hex. – Rick James May 23 '22 at 05:46
  • `utf8mb4_general_ci` treats these as equal: E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě (Most other collations do likewise.) – Rick James May 23 '22 at 05:50
  • @Grumpy - "Change the character set of your table to utf8mb4_general_ci" -- That is a `COLLATION`, not a `CHARACTER SET`. – Rick James May 23 '22 at 05:52
  • "the character set is changed to" -- There are "right" ways and there are "wrong" ways to do that. What did you have before; what commands were used? (Again, I would like to see the hex.) – Rick James May 23 '22 at 05:54
  • @RickJames hi there so I used the following query `SELECT CONCAT('ALTER TABLE ', tbl.TABLE_SCHEMA, '.', tbl.TABLE_NAME, ' CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;') FROM information_schema.TABLES tbl WHERE tbl.TABLE_SCHEMA = 'database_name'` as per the following link https://ubiq.co/database-blog/mysql-change-collation-of-all-tables/ to generate a selection of other queries specific to each table and then proceeded to run them individually against the database and table in question. – insecureabnormality May 23 '22 at 17:13
  • I can roll back the database to before I made these changes if this process was incorrect – insecureabnormality May 23 '22 at 17:15
  • @RickJames I have edited the question to include the hex as requested. Thanks in advance. – insecureabnormality May 23 '22 at 21:14

1 Answers1

1

"Double encoding" seems to be the problem. I discuss this somewhat in Trouble with UTF-8 characters; what I see is not what I stored

Should `

41 65 6E 67 75 73 20 C383 E2809C 67

Óg is hex C393 67 in UTF-8.

Latin1 hex C3 93 67 is Óg

Repeat to get C383 E2809C 67

CONVERT(BINARY(CONVERT('Aengus Óg' USING latin1))
               USING utf8mb4) --> 'Aengus Óg'

This seems to be "double encoding":

CONVERT(BINARY(CONVERT(CONVERT(UNHEX('C383E280B0726975') USING utf8mb4) USING latin1)) USING utf8mb4) --> 'Ériu'

With Ériu as an intermediate step. This explains why it sorted with the A's.

This is a common problem. It often goes unnoticed because browsers "fix" the mess.

Experiment with SELECTs against the table. If the first one works for you, then it is just Mojibake.

SELECT CONVERT(BINARY(CONVERT(my_column USING latin1))
               USING utf8mb4)
    FROM ... WHERE ...;

Read that other Q&A to see what steps went wrong to cause the problem. It likely involves storing UTF-8 characters in a column declared latin1.

ALTER TABLE ... CONVERT TO ... assumes that the data is correctly stored. But it wasn't. Now you have the CHARACTER SET correctly set on the columns, but the data in it has been Mojibaked. So, it needs something like

UPDATE tbl  SET
    col1 = CONVERT(BINARY(CONVERT(col1 USING latin1))
           USING utf8mb4),
    col2 = CONVERT(BINARY(CONVERT(col2 USING latin1))
           USING utf8mb4),
    ...
    ;

More on the fix: http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

Rollback? If you are more comfortable rolling back to before the CONVERT TO, then ignore most of what I said before, then you need the 2-step ALTER after the rollback. (See that blog link.)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Brilliant thanks for getting back to me I will try this over the afternoon and get back to you with results. – insecureabnormality May 24 '22 at 10:46
  • Apologies for the delay in coming back and marking this as correct but it took a bit of figuring out on my part. Needless to say that article you linked above is definitely the definitive resource for any problems like this that may be occurring. Thank you so much for the help. – insecureabnormality Jul 25 '22 at 22:11