1

MYSQL databases

Database1 - table - table1
id - int
title - utf8_general_ci
values -
1, Bienvenue Chez les Ch’tis
2, ABC
3, XYZ

Database2 - table - table2
id - int
title - utf8_unicode_ci
values -
1, Bienvenue Chez les Ch’tis
2, ABC
3, QWE

I have list of titles in both the tables as above.I want to list titles which are not same.

I tries so many different things but no luck.

How can I compare these titles?

SELECT database1.table1.title as title1,database2.table2.title as title2 
FROM database1.table1 JOIN database2.table2 ON (database1.table1.id =database2.table2)
WHERE database1.table1.title NOT LIKE database2.table2.title COLLATE utf8_general_ci

result from above query:
Bienvenue Chez les Ch’tis, Bienvenue Chez les Ch’tis
XYZ, QWE

But I just want below in result
XYZ, QWE

I want to exclude french one, I want to consider it as a identical titles.

David Lee
  • 2,040
  • 17
  • 36
JSoni
  • 13
  • 3
  • can you change db1 to utf8_unicode_ci ? –  Jun 27 '17 at 21:23
  • Those characters look malformed. Both collations you mentioned are unicode so that shouldn't happen. I don't think its a collation issue. – apokryfos Jun 27 '17 at 21:26
  • The collation specifies both character encoding and sorting. You only need to be concerned about the character encoding, and these are the same. The two titles you show *are* different, though that difference may be a data error. – rd_nielsen Jun 27 '17 at 21:28
  • Thanks for all comments. I will look into this. – JSoni Jun 28 '17 at 14:35

1 Answers1

1

’ is Mojibake for . The problem came when storing the same right-single-quote in two different ways. Not a COLLATION issue.

First, check the two tables to see what got stored:

SELECT HEX(title) FROM database1.table1 WHERE id = 1

Two likely cases of what you will find are these:

C  h          ...        t  i  s
43 68        E28099      74 69 73  -- correctly encoded with UTF-8
43 68 C3A2 E282AC E284A2 74 69 73  -- "double encoded"

(I added spacing to the HEX for clarity.)

If one of the tables is "doubly encoded", then it got messed up during INSERTing. The strings will not compare equal.

This Q&A discusses things further. Then see this for fixing the doubly encoded text. But be sure to apply it only to the broken data.

Rick James
  • 135,179
  • 13
  • 127
  • 222