What effects does using a binary collation have?

Question

While answering this question, I became uncertain about something that I didn't manage to find a sufficient answer to.

What are the practical differences between using the binary utf8_bin and the case insensitive utf8_general_ci collations?

I can see three:

Both have a different sorting order; _bin's sorting order is likely to put any umlauts to the end of the alphabet, because byte values are compared (right?)
Only case sensitive searches in _bin
No A = Ä equality in _bin

Are there any other differences or side-effects to be aware of?

Reference:

9.1.2. Character Sets and Collations in MySQL
9.1.7.6. The _bin and binary Collations in the mySQL manual
9.1.7.7. The BINARY Operator

Similar questions that don't address the issue:

UTF-8: General? Bin? Unicode?

score 75 · Accepted Answer · edited Oct 17 '20 at 07:59

75

Binary collation compares your string exactly as strcmp() in C would do, if characters are different (be it just case or diacritics difference). The downside of it that the sort order is not natural.

An example of unnatural sort order (as in "binary" is) : A,B,a,b Natural sort order would be in this case e.g : A,a,B,b (small and capital variations of the same letter are sorted next to each other)

The practical advantage of binary collation is its speed, as string comparison is very simple/fast. In general case, indexes with binary might not produce expected results for sort, however for exact matches they can be useful.

edited Oct 17 '20 at 07:59

ashleedawg

20,365
9
72
105

answered Apr 04 '11 at 01:30

Vladislav Vaintroub

5,308
25
31

Thanks for the examples, they illustrate the issue well. – Pekka May 04 '11 at 07:55
What in case, I need case-sensitive comparison involving pretty big tables (in terms of data volume ~ above 4 million record). I recently encountered this issue, with utf_8_unicode_ci collation, my query does good in terms of performance but does not produce exact matches. If I change collation to utf8_bin, it just gets dead in terms of performance.... a very big difference in terms of performance – shashi009 Apr 14 '16 at 10:13
3

Small correction of the above.. It *almost* works as strcmp(), but not 100% exactly. According to SQL standard, string with different length needs to be compared as if shorter string is padded with space character(0x20) up to the length of the larger string. In most cases, this would behave like strcmp(), but in degenerate cases involving bytes < 0x20 (space) it won't. For example 'a\t' ('a' followed by TAB) is less than 'a ' (a followed by space), thus 'a\t' < 'a' . MariaDB 10.2 introduced NOPAD collations, those match strcmp() behavior exactly, (https://jira.mariadb.org/browse/MDEV-9711) – Vladislav Vaintroub Oct 21 '16 at 13:32

score 20 · Answer 2 · edited Jun 08 '15 at 04:30

utf8_bin: Compares strings by the binary value of each character in the string.

utf8_general_ci: Compares strings using general language rules and using case-insensitive comparisons.

utf8_general_cs: Compares strings using general language rules and using case-sensitive comparisons.

For example, the following will evaluate at true with either of the UTF8_general collations, but not with the utf8_bin collation:

Ä = A Ö = O Ü = U

With the utf8_general_ci collation, they would also return true even if not the same case. http://www.phpbuilder.com/board/showpost.php?s=2e642ac7dc5fceca2dbca1e2b9c424fd&p=10820221&postcount=2

Is *_cs faster than *_ci ? for English/ASCII strings – Yousha Aleayoub May 17 '17 at 10:29 — Yousha Aleayoub, May 17 '17 at 10:29

score 2 · Answer 3 · answered May 07 '11 at 13:36

The other answers explain the differences well.

Binary collation can be useful in some cases :

column contains hexadecimal data like password hashes
you are only interested in exact matches, not sorting
for identifiers with only [a-z0-9_] characters, you can even use it for sorting
for some reason you store numbers in CHAR() or VARCHAR columns (like telephones)
zipcodes
UUIDs
etc

In all those cases you can save a (little) bit of cpu cycles with a binary collation.

What type of binary? ascii_bin, utf8_bin, latin1_bin etc... ? — Yousha Aleayoub, May 17 '17 at 10:30

score 1 · Answer 4 · answered May 04 '11 at 22:05

1

With utf8_general_ci, matches occur without taking case and accentuation into account. It may be a good thing when you need to perform queries on words.

In utf8_bin, the match only occurs when strings are strictly the same. Queries are faster this way.

answered May 04 '11 at 22:05

Mathieu Rodic

6,637
2
43
49

Is utf8_bin faster than latin1_bin or ascii_bin ? – Yousha Aleayoub May 17 '17 at 10:30
2

No, but **_bin** should be faster than **_ci**, as case insensitive quering (or indexing) involves strings transformations. The best way to know how faster is to test against databases containing many entries. – Mathieu Rodic May 23 '17 at 15:38

What effects does using a binary collation have?

4 Answers4

Linked

Related