Charset and Collation in MySql

Charset and Collation in MySql

A character set is a set of symbols and encodings. And a collation is a set of rules for comparing and sorting characters in a character set. A charset must contain at least 1 default collation. For Mysql 8.0 the default MySQL server character set and collation are utf8mb4 and utf8mb4_0900_ai_ci, but we can specify character sets at the server, database, table, column, and string literal levels.

All charset
All collation for a charset.

Now a bit of deep dive:

Mysql uses the weight_string function for sorting and comparing strings. This function returns the weight string for the input string. It has these properties:

Okay, so we know that the weight_string function is used for comparing, but what's the role of collation and charset yet? Let's understand with an example.

With latin1_general_cs

In the above example, we have taken a token A and a and applied charset latin1 and collation latin1_general_cs. We can see, that hex(weight_string()) are different and between A and a, A is smaller. Let's see another example with a different collation.

With latin1_general_ci

In the above example, we have used latin1_general_ci collation, and we can see that both A and a have the same weight and therefore both are equal.

Let's create 2 tables, with  latin1_general_ci and latin1_general_cs collation. We have inserted A and a in both the tables.

We already know that in the case of latin1_general_ci, both a and A have the same weight, let's see what happens when we query for only a. Well, no marks for guessing right. It will get both rows.

And, for latin1_general_cs, it will only fetch the row with the exact wight match. If we fetch the rows order by str_field, A will be before a.

_cs/_ci suffix in collation names means case-sensitive or case-insensitive, if you remember the default collation of MySql 8 is utf8mb4_0900_ai_ci which means it is case-insensitive and accent-insensitive.

SuffixMeaning
_aiAccent-insensitive
_asAccent-sensitive
_ciCase-insensitive
_csCase-sensitive
_ksKana-sensitive
_binBinary

Incompatible collation and charset:

In all the above examples, we have compared strings within the same charset and collation, but what happens when charset/collations are not compatible? Well, we get the Illegal mix of collations error. As a beginner, we all have struggled with this error and had no clue, what caused this, but now we know.

Illegal mix of collations
ERROR 1267 (HY000): Illegal mix of collations (latin1_general_ci,IMPLICIT) and (latin1_general_cs,IMPLICIT) for operation '='

In order to fix this error, we have to make columns compatible with each other, which we can do by making the collation the same across the columns.

Just putting it out there!

While writing this blog, I faced something weird, we already know that:

  • HEX(WEIGHT_STRING(_latin1'A' collate latin1_general_cs)) = 41
  • HEX(WEIGHT_STRING(_latin1'a' collate latin1_general_cs)) = 42  

But when assigning it to a variable @A_with_cs and @a_with_cs respectively, and checking its value, I was getting the same weight value? Weird right!.

Well, this could be because, at the system level in my database, the default collation is set to utf8_general_ci and with that collation, both A_with_cs and a_with_cs are the same.

Update: Changing system-level charset and collation has not solved the problem btw. ¯\_(ツ)_/¯

Conclusions:

A character set is a set of symbols and encodings. And a collation is a set of rules for comparing and sorting characters in a character set.

  • Collation and charset determine the string weight, which is used for comparing and sorting.
  • For Mysql 8.0 the default MySQL server character set and collation are utf8mb4 and utf8mb4_0900_ai_ci. In almost all cases, you should always use utf8mb4_0900_ai_ci collation.
  • Illegal mix of collations error can be fixed by making collations compatible with each other.

Resources:

MySQL :: MySQL 8.0 Reference Manual :: 12.8 String Functions and Operators
MySQL :: MySQL 8.0 Reference Manual :: 10 Character Sets, Collations, Unicode
MySQL Reference Manual
MySQL Localisation and International Usage The Character Set Used for Data and Sorting By default, MySQL uses the ISO-8859-1 (Latin1) character set with sorting according to Swedish/Finnish. This is the … - Selection from MySQL Reference Manual [Book]