====== MySQL and UTF-8 ======
===== Notes =====
Good support from 4.1
**utf-8** is **utf8** in MySQL.
A collation defines the sort order for the data, it may be case sensitive or not
To find out your current setup:
SHOW VARIABLES LIKE 'character_set_database';
SHOW VARIABLES LIKE 'character_set_client';
To see available character sets and collations on your database:
SHOW CHARACTER SET;
SHOW COLLATION LIKE 'utf8%';
Character set and collation can be set per server, database, table, connection;
Server (**/etc/my.cnf**):
[mysqld]
...
default-character-set=utf8
default-collation=utf8_general_ci
Database:
(CREATE | ALTER) DATABASE ... DEFAULT CHARACTER SET utf8
Table:
(CREATE | ALTER) TABLE ... DEFAULT CHARACTER SET utf8
Connection:
SET NAMES 'utf8';
A PHP mysql connection (//not totally confirmed//, but see tests below) defaults to a latin1 connection, so, your first query after connection should be:
mysql_query("SET NAMES 'utf8'");
In php versions 5.2 and later, use
mysql_set_charset('utf8',$conn);
The CONVERT() function can convert between charsets, eg:
INSERT INTO utf8table (utf8column)
SELECT CONVERT(latin1field USING utf8)
FROM latin1table;
As mentioned in [[php:118n:charsets]], field widths may need to be increased to deal with multi-byte characters.
Code to generate a mass change of collations:
';
while ($row_tables = mysql_fetch_row($rs_tables)) {
$table = mysql_real_escape_string($row_tables[0]);
// Alter table collation
// ALTER TABLE `account` DEFAULT CHARACTER SET utf8
if ($show_alter_table) {
echo("ALTER TABLE `$table` DEFAULT CHARACTER SET $character_set;\r\n");
}
$rs = mysql_query(" SHOW FULL FIELDS FROM `$table` ") or die(mysql_error());
while ($row=mysql_fetch_assoc($rs)) {
if ($row['Collation']!=$convert_from)
continue;
// Is the field allowed to be null?
if ($row['Null']=='YES') {
$nullable = ' NULL ';
} else {
$nullable = ' NOT NULL';
}
// Does the field default to null, a string, or nothing?
if ($row['Default']==NULL) {
$default = " DEFAULT NULL";
} else if ($row['Default']!='') {
$default = " DEFAULT '".mysql_real_escape_string($row['Default'])."'";
} else {
$default = '';
}
// Alter field collation:
// ALTER TABLE `account` CHANGE `email` `email` VARCHAR( 50 ) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL
if ($show_alter_field) {
$field = mysql_real_escape_string($row['Field']);
echo "ALTER TABLE `$table` CHANGE `$field` `$field` $row[Type] CHARACTER SET $character_set COLLATE $convert_to $nullable $default; \r\n";
}
}
}
?>
===== MySQL tables, columns and connections with UTF-8 =====
What happens when you ''INSERT'' a UTF-8 string into a MySQL database using PHP's
[[http://php.net/mysql|MySQL extension]]? Well that depends on what string you use. If you use something like "Iñtërnâtiônàlizætiøn" then doing just
about anything in MySQL will be data-safe (but, of course, the collation will be incorrect). If you have characters
that don't encode the same in UTF-8 and latin1 (e.g. text in Chinese, Russian, ...) then the behaviour depends on **both**
the table definition and the encoding of the connection to the database.
| ^ Table or column charset ^^
| ^ latin1 (MySQL default) ^ utf8 ^
^ default connection ''mysql_pconnect()'' |data is binary-safe but not encoded properly* |data is binary-safe but not encoded properly* |
^ using ''SET NAMES 'utf8';'' |data destroyed; string is converted to "????" |works fine! |
* It seems that MySQL will look at the data as being a series of bytes all within the latin1 codepage. If you use the same code to read and write the data then it will round-trip fine (but MySQL's collation will obviously be wrong). If you use tools with a knowledge of the data types used by MySQL, like [[http://phpmyadmin.sf.net|phpMyAdmin]] or even mysql_dump, then the wrong encoding is obvious.
(this test was performed with PHP versions 4.4.2 and 5.1.2 and MySQL 5.0.20; [[http://pastebin.com/751239|PHP source for test]])
===== Further Reading =====
* [[http://blog.stuartherbert.com/gentoo.php/2006/01/11/fun_with_collations|Fun with Collations]]
* [[http://dev.mysql.com/doc/mysql/en/charset-general.html|MySQL documentation]]
* [[http://www.bluetwanger.de/blog/2006/11/20/mysql-and-utf-8-no-more-question-marks/|MySQL and UTF-8 - no more question marks]]