Charset and Collation in MySql

A character set is a set of symbols and encodings. And a collation
is a set of rules for comparing and sorting characters in a character set. A charset must contain at least 1 default collation. For Mysql 8.0 the default MySQL server character set and collation are utf8mb4 and utf8mb4_0900_ai_ci, but we can specify character sets at the server, database, table, column, and string literal levels.


Now a bit of deep dive:
Mysql uses the weight_string
function for sorting and comparing strings. This function returns the weight string for the input string. It has these properties:
- If
WEIGHT_STRING(
=str1
)WEIGHT_STRING(
, thenstr2
)
(str1
=str2
str1
andstr2
are considered equal) - If
WEIGHT_STRING(
<str1
)WEIGHT_STRING(
, thenstr2
)
(str1
<str2
str1
sorts beforestr2
)
Okay, so we know that the weight_string
function is used for comparing, but what's the role of collation and charset yet? Let's understand with an example.

In the above example, we have taken a token A
and a
and applied charset latin1 and collation latin1_general_cs. We can see, that hex(weight_string())
are different and between A
and a
, A
is smaller. Let's see another example with a different collation.

In the above example, we have used latin1_general_ci collation, and we can see that both A
and a
have the same weight and therefore both are equal.
Let's create 2 tables, with latin1_general_ci and latin1_general_cs collation. We have inserted A
and a
in both the tables.

We already know that in the case of latin1_general_ci, both a
and A
have the same weight, let's see what happens when we query for only a
. Well, no marks for guessing right. It will get both rows.

And, for latin1_general_cs, it will only fetch the row with the exact wight match. If we fetch the rows order by str_field, A
will be before a
.

_cs/_ci
suffix in collation names means case-sensitive or case-insensitive, if you remember the default collation of MySql 8 is utf8mb4_0900_ai_ci which means it is case-insensitive and accent-insensitive.
Suffix | Meaning |
---|---|
_ai | Accent-insensitive |
_as | Accent-sensitive |
_ci | Case-insensitive |
_cs | Case-sensitive |
_ks | Kana-sensitive |
_bin | Binary |
Incompatible collation and charset:
In all the above examples, we have compared strings within the same charset and collation, but what happens when charset/collations are not compatible? Well, we get the Illegal mix of collations
error. As a beginner, we all have struggled with this error and had no clue, what caused this, but now we know.

ERROR 1267 (HY000): Illegal mix of collations (latin1_general_ci,IMPLICIT) and (latin1_general_cs,IMPLICIT) for operation '='
In order to fix this error, we have to make columns compatible with each other, which we can do by making the collation the same across the columns.
Just putting it out there!
While writing this blog, I faced something weird, we already know that:
HEX(WEIGHT_STRING(_latin1'A' collate latin1_general_cs))
= 41HEX(WEIGHT_STRING(_latin1'a' collate latin1_general_cs))
= 42
But when assigning it to a variable @A_with_cs
and @a_with_cs
respectively, and checking its value, I was getting the same weight value? Weird right!.

Well, this could be because, at the system level in my database, the default collation is set to utf8_general_ci
and with that collation, both A_with_cs
and a_with_cs
are the same.

Update: Changing system-level charset and collation has not solved the problem btw. ¯\_(ツ)_/¯
Conclusions:
A character set is a set of symbols and encodings. And a collation
is a set of rules for comparing and sorting characters in a character set.
- Collation and charset determine the string weight, which is used for comparing and sorting.
- For Mysql 8.0 the default MySQL server character set and collation are utf8mb4 and utf8mb4_0900_ai_ci. In almost all cases, you should always use
utf8mb4_0900_ai_ci
collation. - Illegal mix of collations error can be fixed by making collations compatible with each other.