If you work with MySQL (or MariaDB) and deal with UTF-8 encoded data, you will often encounter two very similar-looking collations: utf8_general_ci and utf8_unicode_ci. Although both support UTF-8, they behave differently when comparing and sorting text. These differences can directly affect search results, alphabetical ordering, user name matching, and multilingual behavior in your application.
- UTF-8
- Collation
- MySQL
What are "Character Set" and "Collation"?
Before comparing the two collations, it's important to understand the basics:
- Character Set (charset): Defines how characters are stored — how many bytes each character uses, and which range of Unicode symbols are supported.
- Collation: Defines the rules for comparing and sorting characters — whether case is considered, how accents are treated, and cultural sorting rules.
In MySQL, the utf8 charset stores UTF-8 encoded text (but supports only up to 3-byte characters). The collations, such as utf8_general_ci and utf8_unicode_ci, define how strings are evaluated when sorting or comparing.
Main Differences: utf8_general_ci vs utf8_unicode_ci
Here is a deeper, detailed comparison:
| Feature / Behavior | utf8_general_ci |
utf8_unicode_ci |
|---|---|---|
| Sorting & Comparison Rules | Uses simpler, older comparison rules. Compares characters based on direct byte-level or simplified rules. | Based on the official Unicode Collation Algorithm (UCA), which considers linguistic rules, expansions, contractions, multi-character equivalences, etc. |
| Accuracy for International / Accented Characters | Less accurate. Some characters are treated as equal even if they shouldn't be. Example: "ß" may be treated like "s". | More accurate and language-sensitive. Handles accented characters properly — e.g. é, è, ê, ë are not blindly equal. |
| Performance | Faster. Because it uses simplified comparisons, it performs better on large datasets. | Slightly slower, since Unicode-compliant comparison requires more processing. |
| Recommended For | English-only applications, small websites, or places where perfect linguistic accuracy is not required. | Applications supporting multiple languages, user-generated content, global audience, or accurate linguistic sorting. |
Why the Difference Matters
If your website stores only English text, you may never notice any issues. However, when dealing with international names, city names, blog posts, or product titles in multiple languages, incorrect collation leads to:
- Incorrect alphabetical ordering
- Wrong search results (e.g., "Élan" not matching "elan")
- Accented characters being treated incorrectly
- User names not matching correctly
Example (German): Under utf8_general_ci, the character ß may be treated similarly to "s", whereas utf8_unicode_ci recognizes the difference and sorts correctly according to language rules.
What Do Experts Recommend?
Database administrators and multilingual application developers generally recommend:
- Use Unicode-based collations (like
utf8_unicode_ci) for more accurate sorting. - Avoid utf8_general_ci if your application will ever handle multilingual content.
- Use utf8_general_ci only if your project is English-only and performance-sensitive.
Performance difference is minimal for typical websites — correctness usually matters more than a few milliseconds of speed.
Modern Recommendation: Use utf8mb4 + Unicode Collation
Modern MySQL documentation strongly recommends switching from utf8 to utf8mb4, because:
- utf8 supports only 3-byte characters — it cannot store many Unicode characters (like many emojis 😄).
- utf8mb4 supports full UTF-8, including all Unicode symbols, emojis, Asian scripts, rare characters, icons, etc.
The recommended modern collations are:
utf8mb4_unicode_ci— widely used, accurate Unicode sortingutf8mb4_unicode_520_ci— based on newer Unicode rulesutf8mb4_0900_ai_ci— most advanced collation (MySQL 8+), fastest and most accurate
How to Set Charset & Collation (MySQL / PHP / HTML)
When creating a database:
CREATE DATABASE my_blog_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Creating a table:
CREATE TABLE posts ( id INT PRIMARY KEY AUTO_INCREMENT, title VARCHAR(255), content TEXT ) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
In your HTML pages:
<meta charset="UTF-8">
Conclusion — Which Should You Choose?
Here's the simple rule:
- English-only website:
utf8_general_ciis okay, slightly faster. - Multilingual content or international users:
utf8_unicode_cior betterutf8mb4_unicode_ci. - New applications (recommended): Always use
utf8mb4+ Unicode collation.
Choosing the right charset and collation early on saves you from future bugs, incorrect sorting, and complex migrations.
