Why to use utf8_unicode_ci over utf8_general_ci, or simple why never not to use utf8_general_ci.

Basically utf8_general_ci is a broken version of utf8_unicode_ci. It is slightly faster bit only a little bit and it can produce unexpected result while sorting or comparing strings. So why would you want to use a broken encoding? However there are better alternatives of _unicode_ci for example _0900_ai_ci. Those versions are responsible for sorting and compering characters. If you want to ensure that your encoding is accret use don’t forget to use utf8mb4 over regular utf8 to ensure you are getting right character encoding. (the standard utf8 is now getting depreciated by mysql and replaced with utf8mb4).

For those people still arriving at this question in 2020 or later, there are newer options that may be better than both of these. For example, utf8mb4_0900_ai_ci.

All these collations are for the UTF-8 character encoding. The differences are in how text is sorted and compared.

_unicode_ci and _general_ci are two different sets of rules for sorting and comparing text according to the way we expect. Newer versions of MySQL introduce new sets of rules, too, such as _0900_ai_ci for equivalent rules based on Unicode 9.0 – and with no equivalent _general_ci variant. People reading this now should probably use one of these newer collations instead of either _unicode_ci or _general_ci. The description of those older collations below is provided for interest only.

MySQL is currently transitioning away from an older, flawed UTF-8 implementation. For now, you need to use utf8mb4 instead of utf8 for the character encoding part, to ensure you are getting the fixed version. The flawed version remains for backward compatibility, though it is being deprecated.

” src: StackOverflow by thomasrutter (user)

There is also a discussion about difference in collation for utf8-unicode-ci and
utf8-unicode-ci_520_ci due to sorting and comparison differences. For example in utf8-unicode-ci (same as in utf8mb4_polish_ci) L < Ł < M and Ł != L where in utf8-unicode-ci_520_ci Ł == L.

Which one would you use and why is a mater of uncase and preferences so just know that there are few ways and all of them can be good depending on your needs.


Would love your thoughts, please comment.x